Fit data to parametric distribution












6












$begingroup$


I have data with nice bell-shaped histogram PDF. However, the Normal distribution fitting (by calculating mean and variance) does not work as the figure below.



normal fit Failed



My question is that if there are other distributions I should try according to your experience. In other words, by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.



My ultimate goal is to have an approximated analytic form of cummulative distribution function to analyze the according probability. Thus any advices towards solving this goal are appreciated.



I include the data (space as delimiter):
Data to fit.



Update: q-q plot
QQ-plot










share|cite|improve this question











$endgroup$








  • 3




    $begingroup$
    Welcome to the site, Anna. What is your ultimate goal? Why are you trying to fit a distribution to these data? What are you hoping to achieve in the end?
    $endgroup$
    – COOLSerdash
    Jan 5 at 10:53






  • 1




    $begingroup$
    @Xi'an it is not. My question is that by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.
    $endgroup$
    – Anna Noie
    Jan 5 at 11:06








  • 4




    $begingroup$
    A better plot to show us would be a qq-plot
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 11:24






  • 6




    $begingroup$
    There is a different global view on the problem: if you don't know the functional form of the real distribution and hope to judge any fit by its agreement with the observed histogram, the ultimate fit will have the precision of the histogram, due to model uncertainty. So I would just compute the empirical cumulative distribution function, a nonparametric estimator, and be done. This is the cumulative histogram when there is no binning of the data.
    $endgroup$
    – Frank Harrell
    Jan 5 at 12:32






  • 1




    $begingroup$
    One thing to try is my online statistical distribution fitter at zunzun.com/StatisticalDistributions/1 to see of it suggests any good candidate distributions. It fits the data to over 80 of the continuous statistical distributions in scipy, and is open source.
    $endgroup$
    – James Phillips
    Jan 5 at 16:18
















6












$begingroup$


I have data with nice bell-shaped histogram PDF. However, the Normal distribution fitting (by calculating mean and variance) does not work as the figure below.



normal fit Failed



My question is that if there are other distributions I should try according to your experience. In other words, by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.



My ultimate goal is to have an approximated analytic form of cummulative distribution function to analyze the according probability. Thus any advices towards solving this goal are appreciated.



I include the data (space as delimiter):
Data to fit.



Update: q-q plot
QQ-plot










share|cite|improve this question











$endgroup$








  • 3




    $begingroup$
    Welcome to the site, Anna. What is your ultimate goal? Why are you trying to fit a distribution to these data? What are you hoping to achieve in the end?
    $endgroup$
    – COOLSerdash
    Jan 5 at 10:53






  • 1




    $begingroup$
    @Xi'an it is not. My question is that by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.
    $endgroup$
    – Anna Noie
    Jan 5 at 11:06








  • 4




    $begingroup$
    A better plot to show us would be a qq-plot
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 11:24






  • 6




    $begingroup$
    There is a different global view on the problem: if you don't know the functional form of the real distribution and hope to judge any fit by its agreement with the observed histogram, the ultimate fit will have the precision of the histogram, due to model uncertainty. So I would just compute the empirical cumulative distribution function, a nonparametric estimator, and be done. This is the cumulative histogram when there is no binning of the data.
    $endgroup$
    – Frank Harrell
    Jan 5 at 12:32






  • 1




    $begingroup$
    One thing to try is my online statistical distribution fitter at zunzun.com/StatisticalDistributions/1 to see of it suggests any good candidate distributions. It fits the data to over 80 of the continuous statistical distributions in scipy, and is open source.
    $endgroup$
    – James Phillips
    Jan 5 at 16:18














6












6








6


1



$begingroup$


I have data with nice bell-shaped histogram PDF. However, the Normal distribution fitting (by calculating mean and variance) does not work as the figure below.



normal fit Failed



My question is that if there are other distributions I should try according to your experience. In other words, by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.



My ultimate goal is to have an approximated analytic form of cummulative distribution function to analyze the according probability. Thus any advices towards solving this goal are appreciated.



I include the data (space as delimiter):
Data to fit.



Update: q-q plot
QQ-plot










share|cite|improve this question











$endgroup$




I have data with nice bell-shaped histogram PDF. However, the Normal distribution fitting (by calculating mean and variance) does not work as the figure below.



normal fit Failed



My question is that if there are other distributions I should try according to your experience. In other words, by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.



My ultimate goal is to have an approximated analytic form of cummulative distribution function to analyze the according probability. Thus any advices towards solving this goal are appreciated.



I include the data (space as delimiter):
Data to fit.



Update: q-q plot
QQ-plot







distributions normal-distribution descriptive-statistics fitting






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Jan 7 at 20:32









kjetil b halvorsen

31.2k983224




31.2k983224










asked Jan 5 at 10:40









Anna NoieAnna Noie

312




312








  • 3




    $begingroup$
    Welcome to the site, Anna. What is your ultimate goal? Why are you trying to fit a distribution to these data? What are you hoping to achieve in the end?
    $endgroup$
    – COOLSerdash
    Jan 5 at 10:53






  • 1




    $begingroup$
    @Xi'an it is not. My question is that by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.
    $endgroup$
    – Anna Noie
    Jan 5 at 11:06








  • 4




    $begingroup$
    A better plot to show us would be a qq-plot
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 11:24






  • 6




    $begingroup$
    There is a different global view on the problem: if you don't know the functional form of the real distribution and hope to judge any fit by its agreement with the observed histogram, the ultimate fit will have the precision of the histogram, due to model uncertainty. So I would just compute the empirical cumulative distribution function, a nonparametric estimator, and be done. This is the cumulative histogram when there is no binning of the data.
    $endgroup$
    – Frank Harrell
    Jan 5 at 12:32






  • 1




    $begingroup$
    One thing to try is my online statistical distribution fitter at zunzun.com/StatisticalDistributions/1 to see of it suggests any good candidate distributions. It fits the data to over 80 of the continuous statistical distributions in scipy, and is open source.
    $endgroup$
    – James Phillips
    Jan 5 at 16:18














  • 3




    $begingroup$
    Welcome to the site, Anna. What is your ultimate goal? Why are you trying to fit a distribution to these data? What are you hoping to achieve in the end?
    $endgroup$
    – COOLSerdash
    Jan 5 at 10:53






  • 1




    $begingroup$
    @Xi'an it is not. My question is that by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.
    $endgroup$
    – Anna Noie
    Jan 5 at 11:06








  • 4




    $begingroup$
    A better plot to show us would be a qq-plot
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 11:24






  • 6




    $begingroup$
    There is a different global view on the problem: if you don't know the functional form of the real distribution and hope to judge any fit by its agreement with the observed histogram, the ultimate fit will have the precision of the histogram, due to model uncertainty. So I would just compute the empirical cumulative distribution function, a nonparametric estimator, and be done. This is the cumulative histogram when there is no binning of the data.
    $endgroup$
    – Frank Harrell
    Jan 5 at 12:32






  • 1




    $begingroup$
    One thing to try is my online statistical distribution fitter at zunzun.com/StatisticalDistributions/1 to see of it suggests any good candidate distributions. It fits the data to over 80 of the continuous statistical distributions in scipy, and is open source.
    $endgroup$
    – James Phillips
    Jan 5 at 16:18








3




3




$begingroup$
Welcome to the site, Anna. What is your ultimate goal? Why are you trying to fit a distribution to these data? What are you hoping to achieve in the end?
$endgroup$
– COOLSerdash
Jan 5 at 10:53




$begingroup$
Welcome to the site, Anna. What is your ultimate goal? Why are you trying to fit a distribution to these data? What are you hoping to achieve in the end?
$endgroup$
– COOLSerdash
Jan 5 at 10:53




1




1




$begingroup$
@Xi'an it is not. My question is that by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.
$endgroup$
– Anna Noie
Jan 5 at 11:06






$begingroup$
@Xi'an it is not. My question is that by which distribution should a nice bell-shaped data, which is not well fit by Normal distribution, be fit.
$endgroup$
– Anna Noie
Jan 5 at 11:06






4




4




$begingroup$
A better plot to show us would be a qq-plot
$endgroup$
– kjetil b halvorsen
Jan 5 at 11:24




$begingroup$
A better plot to show us would be a qq-plot
$endgroup$
– kjetil b halvorsen
Jan 5 at 11:24




6




6




$begingroup$
There is a different global view on the problem: if you don't know the functional form of the real distribution and hope to judge any fit by its agreement with the observed histogram, the ultimate fit will have the precision of the histogram, due to model uncertainty. So I would just compute the empirical cumulative distribution function, a nonparametric estimator, and be done. This is the cumulative histogram when there is no binning of the data.
$endgroup$
– Frank Harrell
Jan 5 at 12:32




$begingroup$
There is a different global view on the problem: if you don't know the functional form of the real distribution and hope to judge any fit by its agreement with the observed histogram, the ultimate fit will have the precision of the histogram, due to model uncertainty. So I would just compute the empirical cumulative distribution function, a nonparametric estimator, and be done. This is the cumulative histogram when there is no binning of the data.
$endgroup$
– Frank Harrell
Jan 5 at 12:32




1




1




$begingroup$
One thing to try is my online statistical distribution fitter at zunzun.com/StatisticalDistributions/1 to see of it suggests any good candidate distributions. It fits the data to over 80 of the continuous statistical distributions in scipy, and is open source.
$endgroup$
– James Phillips
Jan 5 at 16:18




$begingroup$
One thing to try is my online statistical distribution fitter at zunzun.com/StatisticalDistributions/1 to see of it suggests any good candidate distributions. It fits the data to over 80 of the continuous statistical distributions in scipy, and is open source.
$endgroup$
– James Phillips
Jan 5 at 16:18










3 Answers
3






active

oldest

votes


















3












$begingroup$

The histogram, as presented by the OP, gives the impression that the data is symmetrical. Given that the data is noticeably more peaked than Normal, and if the data is roughly symmetrical, then a natural suggestion to try is the Student's t with location parameter $mu$, scale parameter $sigma$, and $v$ degrees of freedom, and pdf $f(x)$:



$$f = frac{1}{sigma sqrt{v} ; Bleft(frac{v}{2},frac{1}{2}right)} left(frac{v}{v+frac{(x-mu )^2}{sigma ^2}}right)^{frac{v+1}{2}} quad text{defined on the real line}$$



Student t fit



The following diagram shows a sample fit using the Student's t, with $mu = 5.45$, $sigma = 6.61$ and $v = 2.97$:



enter image description here



In the diagram:




  • the dashed red curve is the fitted Student's t pdf


  • the squiggly blue curve is the empirical pdf (frequency polygon) of the raw data



On the upside, this appears to be a significantly better fit than the Normal, using the same raw data set provided.



On the possible downside, I am not sure I would fully agree with the OP's opening statement: "I have data with nice bell-shaped histogram PDF". In particular, if one looks more closely at your data set (which contains 100,000 samples), the maximum is 37.45, while the minimum is -910. Moreover, there is not just one large negative value, but a whole bunch of them. This suggests that your data set is not symmetrical, but negatively skewed ... and that there other things going on in the tails, and if so, other distributions may perhaps be better suited. Zooming out, again with the same Student's t fit, we can see this feature of the data, in the right and left tails:



enter image description here






share|cite|improve this answer









$endgroup$













  • $begingroup$
    I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
    $endgroup$
    – James Phillips
    Jan 5 at 22:06



















3












$begingroup$

In short: your two plots show a big discrepancy, the smallest values shown in the histogram is about $-30$, while the qqplot shows values down to around $-900$. All those lom-tailed outliers is about 0.7% of the sample, but dominates the qqplot. So you need to ask yourself what produces those outliers! and that should guide you to what to do with your data. If I make a qqplot after eliminating that long tail, it looks much closer to normal, but not perfect. Look at these:



mean(Y)
[1] 3.9657
mean(Y[Y>= -30])
[1] 4.414797


but the effect on standard deviation is larger:



sd(Y)
[1] 10.92237
sd(Y[Y>= -30])
[1] 8.006223


and that explains the strange form of your first plot (histogram): the fitted normal curve you shows is influenced by that long tail you omitted from the plot.






share|cite|improve this answer











$endgroup$









  • 1




    $begingroup$
    Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
    $endgroup$
    – Anna Noie
    Jan 5 at 14:16






  • 4




    $begingroup$
    To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 14:18



















0












$begingroup$

You might try a Gaussian mixture, which is easy using Mclust in the mclust library of R.



library(mclust)



mc.fit = Mclust(data$V1)



summary(mc.fit,parameters=TRUE)



This gives a three-component Gaussian mixture (8 parameters total), with components



1: N(-69.269908, 6995.71627), p1 = 0.003970506



2: N( -4.314187, 171.76873), p2 = 0.115329209



3: N( 5.380137, 46.26587), p3 = 0.880700285



The log likelihood is -352620.4, which you can use to compare other possible fits such as those suggested.



The long left tail is captured by the first two components, especially the first.



The cumulative distribution estimate at "x" is (in R form)



p1*pnorm(x, -69.269908, sqrt(6995.71627)) + p2*pnorm(x, -4.314187, sqrt(171.76873))
+ p3*pnorm(x, 5.380137, sqrt(46.26587))



I tried various quantiles (x) from .0001 to .9999 and the accuracy of the estimate seems reasonable to me.






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f385728%2ffit-data-to-parametric-distribution%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3












    $begingroup$

    The histogram, as presented by the OP, gives the impression that the data is symmetrical. Given that the data is noticeably more peaked than Normal, and if the data is roughly symmetrical, then a natural suggestion to try is the Student's t with location parameter $mu$, scale parameter $sigma$, and $v$ degrees of freedom, and pdf $f(x)$:



    $$f = frac{1}{sigma sqrt{v} ; Bleft(frac{v}{2},frac{1}{2}right)} left(frac{v}{v+frac{(x-mu )^2}{sigma ^2}}right)^{frac{v+1}{2}} quad text{defined on the real line}$$



    Student t fit



    The following diagram shows a sample fit using the Student's t, with $mu = 5.45$, $sigma = 6.61$ and $v = 2.97$:



    enter image description here



    In the diagram:




    • the dashed red curve is the fitted Student's t pdf


    • the squiggly blue curve is the empirical pdf (frequency polygon) of the raw data



    On the upside, this appears to be a significantly better fit than the Normal, using the same raw data set provided.



    On the possible downside, I am not sure I would fully agree with the OP's opening statement: "I have data with nice bell-shaped histogram PDF". In particular, if one looks more closely at your data set (which contains 100,000 samples), the maximum is 37.45, while the minimum is -910. Moreover, there is not just one large negative value, but a whole bunch of them. This suggests that your data set is not symmetrical, but negatively skewed ... and that there other things going on in the tails, and if so, other distributions may perhaps be better suited. Zooming out, again with the same Student's t fit, we can see this feature of the data, in the right and left tails:



    enter image description here






    share|cite|improve this answer









    $endgroup$













    • $begingroup$
      I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
      $endgroup$
      – James Phillips
      Jan 5 at 22:06
















    3












    $begingroup$

    The histogram, as presented by the OP, gives the impression that the data is symmetrical. Given that the data is noticeably more peaked than Normal, and if the data is roughly symmetrical, then a natural suggestion to try is the Student's t with location parameter $mu$, scale parameter $sigma$, and $v$ degrees of freedom, and pdf $f(x)$:



    $$f = frac{1}{sigma sqrt{v} ; Bleft(frac{v}{2},frac{1}{2}right)} left(frac{v}{v+frac{(x-mu )^2}{sigma ^2}}right)^{frac{v+1}{2}} quad text{defined on the real line}$$



    Student t fit



    The following diagram shows a sample fit using the Student's t, with $mu = 5.45$, $sigma = 6.61$ and $v = 2.97$:



    enter image description here



    In the diagram:




    • the dashed red curve is the fitted Student's t pdf


    • the squiggly blue curve is the empirical pdf (frequency polygon) of the raw data



    On the upside, this appears to be a significantly better fit than the Normal, using the same raw data set provided.



    On the possible downside, I am not sure I would fully agree with the OP's opening statement: "I have data with nice bell-shaped histogram PDF". In particular, if one looks more closely at your data set (which contains 100,000 samples), the maximum is 37.45, while the minimum is -910. Moreover, there is not just one large negative value, but a whole bunch of them. This suggests that your data set is not symmetrical, but negatively skewed ... and that there other things going on in the tails, and if so, other distributions may perhaps be better suited. Zooming out, again with the same Student's t fit, we can see this feature of the data, in the right and left tails:



    enter image description here






    share|cite|improve this answer









    $endgroup$













    • $begingroup$
      I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
      $endgroup$
      – James Phillips
      Jan 5 at 22:06














    3












    3








    3





    $begingroup$

    The histogram, as presented by the OP, gives the impression that the data is symmetrical. Given that the data is noticeably more peaked than Normal, and if the data is roughly symmetrical, then a natural suggestion to try is the Student's t with location parameter $mu$, scale parameter $sigma$, and $v$ degrees of freedom, and pdf $f(x)$:



    $$f = frac{1}{sigma sqrt{v} ; Bleft(frac{v}{2},frac{1}{2}right)} left(frac{v}{v+frac{(x-mu )^2}{sigma ^2}}right)^{frac{v+1}{2}} quad text{defined on the real line}$$



    Student t fit



    The following diagram shows a sample fit using the Student's t, with $mu = 5.45$, $sigma = 6.61$ and $v = 2.97$:



    enter image description here



    In the diagram:




    • the dashed red curve is the fitted Student's t pdf


    • the squiggly blue curve is the empirical pdf (frequency polygon) of the raw data



    On the upside, this appears to be a significantly better fit than the Normal, using the same raw data set provided.



    On the possible downside, I am not sure I would fully agree with the OP's opening statement: "I have data with nice bell-shaped histogram PDF". In particular, if one looks more closely at your data set (which contains 100,000 samples), the maximum is 37.45, while the minimum is -910. Moreover, there is not just one large negative value, but a whole bunch of them. This suggests that your data set is not symmetrical, but negatively skewed ... and that there other things going on in the tails, and if so, other distributions may perhaps be better suited. Zooming out, again with the same Student's t fit, we can see this feature of the data, in the right and left tails:



    enter image description here






    share|cite|improve this answer









    $endgroup$



    The histogram, as presented by the OP, gives the impression that the data is symmetrical. Given that the data is noticeably more peaked than Normal, and if the data is roughly symmetrical, then a natural suggestion to try is the Student's t with location parameter $mu$, scale parameter $sigma$, and $v$ degrees of freedom, and pdf $f(x)$:



    $$f = frac{1}{sigma sqrt{v} ; Bleft(frac{v}{2},frac{1}{2}right)} left(frac{v}{v+frac{(x-mu )^2}{sigma ^2}}right)^{frac{v+1}{2}} quad text{defined on the real line}$$



    Student t fit



    The following diagram shows a sample fit using the Student's t, with $mu = 5.45$, $sigma = 6.61$ and $v = 2.97$:



    enter image description here



    In the diagram:




    • the dashed red curve is the fitted Student's t pdf


    • the squiggly blue curve is the empirical pdf (frequency polygon) of the raw data



    On the upside, this appears to be a significantly better fit than the Normal, using the same raw data set provided.



    On the possible downside, I am not sure I would fully agree with the OP's opening statement: "I have data with nice bell-shaped histogram PDF". In particular, if one looks more closely at your data set (which contains 100,000 samples), the maximum is 37.45, while the minimum is -910. Moreover, there is not just one large negative value, but a whole bunch of them. This suggests that your data set is not symmetrical, but negatively skewed ... and that there other things going on in the tails, and if so, other distributions may perhaps be better suited. Zooming out, again with the same Student's t fit, we can see this feature of the data, in the right and left tails:



    enter image description here







    share|cite|improve this answer












    share|cite|improve this answer



    share|cite|improve this answer










    answered Jan 5 at 17:31









    wolfieswolfies

    5,65011722




    5,65011722












    • $begingroup$
      I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
      $endgroup$
      – James Phillips
      Jan 5 at 22:06


















    • $begingroup$
      I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
      $endgroup$
      – James Phillips
      Jan 5 at 22:06
















    $begingroup$
    I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
    $endgroup$
    – James Phillips
    Jan 5 at 22:06




    $begingroup$
    I also found student;s t to work well per your answer. Additionally, I found the Johnson SU to work well here.
    $endgroup$
    – James Phillips
    Jan 5 at 22:06













    3












    $begingroup$

    In short: your two plots show a big discrepancy, the smallest values shown in the histogram is about $-30$, while the qqplot shows values down to around $-900$. All those lom-tailed outliers is about 0.7% of the sample, but dominates the qqplot. So you need to ask yourself what produces those outliers! and that should guide you to what to do with your data. If I make a qqplot after eliminating that long tail, it looks much closer to normal, but not perfect. Look at these:



    mean(Y)
    [1] 3.9657
    mean(Y[Y>= -30])
    [1] 4.414797


    but the effect on standard deviation is larger:



    sd(Y)
    [1] 10.92237
    sd(Y[Y>= -30])
    [1] 8.006223


    and that explains the strange form of your first plot (histogram): the fitted normal curve you shows is influenced by that long tail you omitted from the plot.






    share|cite|improve this answer











    $endgroup$









    • 1




      $begingroup$
      Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
      $endgroup$
      – Anna Noie
      Jan 5 at 14:16






    • 4




      $begingroup$
      To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
      $endgroup$
      – kjetil b halvorsen
      Jan 5 at 14:18
















    3












    $begingroup$

    In short: your two plots show a big discrepancy, the smallest values shown in the histogram is about $-30$, while the qqplot shows values down to around $-900$. All those lom-tailed outliers is about 0.7% of the sample, but dominates the qqplot. So you need to ask yourself what produces those outliers! and that should guide you to what to do with your data. If I make a qqplot after eliminating that long tail, it looks much closer to normal, but not perfect. Look at these:



    mean(Y)
    [1] 3.9657
    mean(Y[Y>= -30])
    [1] 4.414797


    but the effect on standard deviation is larger:



    sd(Y)
    [1] 10.92237
    sd(Y[Y>= -30])
    [1] 8.006223


    and that explains the strange form of your first plot (histogram): the fitted normal curve you shows is influenced by that long tail you omitted from the plot.






    share|cite|improve this answer











    $endgroup$









    • 1




      $begingroup$
      Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
      $endgroup$
      – Anna Noie
      Jan 5 at 14:16






    • 4




      $begingroup$
      To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
      $endgroup$
      – kjetil b halvorsen
      Jan 5 at 14:18














    3












    3








    3





    $begingroup$

    In short: your two plots show a big discrepancy, the smallest values shown in the histogram is about $-30$, while the qqplot shows values down to around $-900$. All those lom-tailed outliers is about 0.7% of the sample, but dominates the qqplot. So you need to ask yourself what produces those outliers! and that should guide you to what to do with your data. If I make a qqplot after eliminating that long tail, it looks much closer to normal, but not perfect. Look at these:



    mean(Y)
    [1] 3.9657
    mean(Y[Y>= -30])
    [1] 4.414797


    but the effect on standard deviation is larger:



    sd(Y)
    [1] 10.92237
    sd(Y[Y>= -30])
    [1] 8.006223


    and that explains the strange form of your first plot (histogram): the fitted normal curve you shows is influenced by that long tail you omitted from the plot.






    share|cite|improve this answer











    $endgroup$



    In short: your two plots show a big discrepancy, the smallest values shown in the histogram is about $-30$, while the qqplot shows values down to around $-900$. All those lom-tailed outliers is about 0.7% of the sample, but dominates the qqplot. So you need to ask yourself what produces those outliers! and that should guide you to what to do with your data. If I make a qqplot after eliminating that long tail, it looks much closer to normal, but not perfect. Look at these:



    mean(Y)
    [1] 3.9657
    mean(Y[Y>= -30])
    [1] 4.414797


    but the effect on standard deviation is larger:



    sd(Y)
    [1] 10.92237
    sd(Y[Y>= -30])
    [1] 8.006223


    and that explains the strange form of your first plot (histogram): the fitted normal curve you shows is influenced by that long tail you omitted from the plot.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited Jan 7 at 20:33

























    answered Jan 5 at 13:03









    kjetil b halvorsenkjetil b halvorsen

    31.2k983224




    31.2k983224








    • 1




      $begingroup$
      Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
      $endgroup$
      – Anna Noie
      Jan 5 at 14:16






    • 4




      $begingroup$
      To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
      $endgroup$
      – kjetil b halvorsen
      Jan 5 at 14:18














    • 1




      $begingroup$
      Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
      $endgroup$
      – Anna Noie
      Jan 5 at 14:16






    • 4




      $begingroup$
      To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
      $endgroup$
      – kjetil b halvorsen
      Jan 5 at 14:18








    1




    1




    $begingroup$
    Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
    $endgroup$
    – Anna Noie
    Jan 5 at 14:16




    $begingroup$
    Thanks, great advice. I understand that perfect fitting of the probability density function is not very feasible. As I just edit my question, my goal is to have an approximated analytic CDF. I am thinking about your advice about dominated outliners.
    $endgroup$
    – Anna Noie
    Jan 5 at 14:16




    4




    4




    $begingroup$
    To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 14:18




    $begingroup$
    To give better advice, we really need to know the context. What does your variable measure, and what is the goal of modeling?
    $endgroup$
    – kjetil b halvorsen
    Jan 5 at 14:18











    0












    $begingroup$

    You might try a Gaussian mixture, which is easy using Mclust in the mclust library of R.



    library(mclust)



    mc.fit = Mclust(data$V1)



    summary(mc.fit,parameters=TRUE)



    This gives a three-component Gaussian mixture (8 parameters total), with components



    1: N(-69.269908, 6995.71627), p1 = 0.003970506



    2: N( -4.314187, 171.76873), p2 = 0.115329209



    3: N( 5.380137, 46.26587), p3 = 0.880700285



    The log likelihood is -352620.4, which you can use to compare other possible fits such as those suggested.



    The long left tail is captured by the first two components, especially the first.



    The cumulative distribution estimate at "x" is (in R form)



    p1*pnorm(x, -69.269908, sqrt(6995.71627)) + p2*pnorm(x, -4.314187, sqrt(171.76873))
    + p3*pnorm(x, 5.380137, sqrt(46.26587))



    I tried various quantiles (x) from .0001 to .9999 and the accuracy of the estimate seems reasonable to me.






    share|cite|improve this answer









    $endgroup$


















      0












      $begingroup$

      You might try a Gaussian mixture, which is easy using Mclust in the mclust library of R.



      library(mclust)



      mc.fit = Mclust(data$V1)



      summary(mc.fit,parameters=TRUE)



      This gives a three-component Gaussian mixture (8 parameters total), with components



      1: N(-69.269908, 6995.71627), p1 = 0.003970506



      2: N( -4.314187, 171.76873), p2 = 0.115329209



      3: N( 5.380137, 46.26587), p3 = 0.880700285



      The log likelihood is -352620.4, which you can use to compare other possible fits such as those suggested.



      The long left tail is captured by the first two components, especially the first.



      The cumulative distribution estimate at "x" is (in R form)



      p1*pnorm(x, -69.269908, sqrt(6995.71627)) + p2*pnorm(x, -4.314187, sqrt(171.76873))
      + p3*pnorm(x, 5.380137, sqrt(46.26587))



      I tried various quantiles (x) from .0001 to .9999 and the accuracy of the estimate seems reasonable to me.






      share|cite|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        You might try a Gaussian mixture, which is easy using Mclust in the mclust library of R.



        library(mclust)



        mc.fit = Mclust(data$V1)



        summary(mc.fit,parameters=TRUE)



        This gives a three-component Gaussian mixture (8 parameters total), with components



        1: N(-69.269908, 6995.71627), p1 = 0.003970506



        2: N( -4.314187, 171.76873), p2 = 0.115329209



        3: N( 5.380137, 46.26587), p3 = 0.880700285



        The log likelihood is -352620.4, which you can use to compare other possible fits such as those suggested.



        The long left tail is captured by the first two components, especially the first.



        The cumulative distribution estimate at "x" is (in R form)



        p1*pnorm(x, -69.269908, sqrt(6995.71627)) + p2*pnorm(x, -4.314187, sqrt(171.76873))
        + p3*pnorm(x, 5.380137, sqrt(46.26587))



        I tried various quantiles (x) from .0001 to .9999 and the accuracy of the estimate seems reasonable to me.






        share|cite|improve this answer









        $endgroup$



        You might try a Gaussian mixture, which is easy using Mclust in the mclust library of R.



        library(mclust)



        mc.fit = Mclust(data$V1)



        summary(mc.fit,parameters=TRUE)



        This gives a three-component Gaussian mixture (8 parameters total), with components



        1: N(-69.269908, 6995.71627), p1 = 0.003970506



        2: N( -4.314187, 171.76873), p2 = 0.115329209



        3: N( 5.380137, 46.26587), p3 = 0.880700285



        The log likelihood is -352620.4, which you can use to compare other possible fits such as those suggested.



        The long left tail is captured by the first two components, especially the first.



        The cumulative distribution estimate at "x" is (in R form)



        p1*pnorm(x, -69.269908, sqrt(6995.71627)) + p2*pnorm(x, -4.314187, sqrt(171.76873))
        + p3*pnorm(x, 5.380137, sqrt(46.26587))



        I tried various quantiles (x) from .0001 to .9999 and the accuracy of the estimate seems reasonable to me.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Jan 10 at 16:20









        Peter WestfallPeter Westfall

        940412




        940412






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f385728%2ffit-data-to-parametric-distribution%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Bressuire

            Cabo Verde

            Gyllenstierna