Simple Statistics/Probability Problem
$begingroup$
I have used a python script to identify target sequences in a DNA sequence file.
There are two classes of sequence: coding and non-coding. I have identified $728$ sequences of interest. $597$ of these fall into the coding regions and $131$ of these fall into the non-coding regions. This is the equivalent of $18%,$ non-coding, but the total non-coding region in the sequence file is $13% $.
Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?
If the script identified sequences that were randomly distributed then $13% $ of them would have been found in the non-coding region, from a total of $728$ sequences. This seems like it should be reliable.
I hope my question is clear.
probability statistics biology
$endgroup$
|
show 1 more comment
$begingroup$
I have used a python script to identify target sequences in a DNA sequence file.
There are two classes of sequence: coding and non-coding. I have identified $728$ sequences of interest. $597$ of these fall into the coding regions and $131$ of these fall into the non-coding regions. This is the equivalent of $18%,$ non-coding, but the total non-coding region in the sequence file is $13% $.
Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?
If the script identified sequences that were randomly distributed then $13% $ of them would have been found in the non-coding region, from a total of $728$ sequences. This seems like it should be reliable.
I hope my question is clear.
probability statistics biology
$endgroup$
$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39
$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41
$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42
$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49
$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19
|
show 1 more comment
$begingroup$
I have used a python script to identify target sequences in a DNA sequence file.
There are two classes of sequence: coding and non-coding. I have identified $728$ sequences of interest. $597$ of these fall into the coding regions and $131$ of these fall into the non-coding regions. This is the equivalent of $18%,$ non-coding, but the total non-coding region in the sequence file is $13% $.
Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?
If the script identified sequences that were randomly distributed then $13% $ of them would have been found in the non-coding region, from a total of $728$ sequences. This seems like it should be reliable.
I hope my question is clear.
probability statistics biology
$endgroup$
I have used a python script to identify target sequences in a DNA sequence file.
There are two classes of sequence: coding and non-coding. I have identified $728$ sequences of interest. $597$ of these fall into the coding regions and $131$ of these fall into the non-coding regions. This is the equivalent of $18%,$ non-coding, but the total non-coding region in the sequence file is $13% $.
Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?
If the script identified sequences that were randomly distributed then $13% $ of them would have been found in the non-coding region, from a total of $728$ sequences. This seems like it should be reliable.
I hope my question is clear.
probability statistics biology
probability statistics biology
edited Jan 7 at 22:36
Zacky
7,80511062
7,80511062
asked Jan 7 at 21:30
Ryan_J_HopeRyan_J_Hope
133
133
$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39
$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41
$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42
$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49
$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19
|
show 1 more comment
$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39
$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41
$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42
$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49
$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19
$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39
$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39
$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41
$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41
$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42
$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42
$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49
$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49
$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19
$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19
|
show 1 more comment
1 Answer
1
active
oldest
votes
$begingroup$
Your null hypothesis is $H_0: p = 0.13$ against the alternative
$H_a: p ne 0.13,$ where $p = P(text{Non Coding}).$
You observe $X =131$ non-coding sequences among $n = 728$ observed,
which gives you $hat p = 0.1812$ as the observed frequency.
Because the observed frequency is substantially different from $p = 0.13$
you wonder whether this might have been an 'unlucky' draw, or whether
you have statistically significant evidence that the method of sampling is unfair.
This is called a "one-sample binomial test". Often this test is done by
using a normal approximation to the binomial distribution. You can find
that method in elementary statistics textbooks. The output below from
Minitab statistical software uses the binomial distribution to give an
exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]
If the P-value is less than 5%, one says that the null
hypothesis is rejected at the 5% level of significance. Here the P-value
is printed as 0.000
which means that the P-value is smaller than 0.0005.
So it is extremely unlikely that an unbiased draw would give an observed
proportion of non-coding sequences so far from $p = 0.13.$
Test and CI for One Proportion
Test of p = 0.13 vs p ≠ 0.13
Exact
Sample X N Sample p 95% CI P-Value
1 131 723 0.181189 (0.153769, 0.211239) 0.000
Another way to interpret the output is that a 95% confidence interval
for $p$ is $(0.154, 0.211),$ which is centered at $hat p = 0.1812,$ but
does not contain $p = 0.13.$ Thus it is difficult to believe that
the sampling procedure would have given close to the true value $p = 0.13.$
Note: Yet another approach is to note that quantiles .025 and .975 of
the 'null distribution' $mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably
above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)
qbinom(c(.025,.975), 723, .13)
[1] 77 112
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3065525%2fsimple-statistics-probability-problem%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Your null hypothesis is $H_0: p = 0.13$ against the alternative
$H_a: p ne 0.13,$ where $p = P(text{Non Coding}).$
You observe $X =131$ non-coding sequences among $n = 728$ observed,
which gives you $hat p = 0.1812$ as the observed frequency.
Because the observed frequency is substantially different from $p = 0.13$
you wonder whether this might have been an 'unlucky' draw, or whether
you have statistically significant evidence that the method of sampling is unfair.
This is called a "one-sample binomial test". Often this test is done by
using a normal approximation to the binomial distribution. You can find
that method in elementary statistics textbooks. The output below from
Minitab statistical software uses the binomial distribution to give an
exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]
If the P-value is less than 5%, one says that the null
hypothesis is rejected at the 5% level of significance. Here the P-value
is printed as 0.000
which means that the P-value is smaller than 0.0005.
So it is extremely unlikely that an unbiased draw would give an observed
proportion of non-coding sequences so far from $p = 0.13.$
Test and CI for One Proportion
Test of p = 0.13 vs p ≠ 0.13
Exact
Sample X N Sample p 95% CI P-Value
1 131 723 0.181189 (0.153769, 0.211239) 0.000
Another way to interpret the output is that a 95% confidence interval
for $p$ is $(0.154, 0.211),$ which is centered at $hat p = 0.1812,$ but
does not contain $p = 0.13.$ Thus it is difficult to believe that
the sampling procedure would have given close to the true value $p = 0.13.$
Note: Yet another approach is to note that quantiles .025 and .975 of
the 'null distribution' $mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably
above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)
qbinom(c(.025,.975), 723, .13)
[1] 77 112
$endgroup$
add a comment |
$begingroup$
Your null hypothesis is $H_0: p = 0.13$ against the alternative
$H_a: p ne 0.13,$ where $p = P(text{Non Coding}).$
You observe $X =131$ non-coding sequences among $n = 728$ observed,
which gives you $hat p = 0.1812$ as the observed frequency.
Because the observed frequency is substantially different from $p = 0.13$
you wonder whether this might have been an 'unlucky' draw, or whether
you have statistically significant evidence that the method of sampling is unfair.
This is called a "one-sample binomial test". Often this test is done by
using a normal approximation to the binomial distribution. You can find
that method in elementary statistics textbooks. The output below from
Minitab statistical software uses the binomial distribution to give an
exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]
If the P-value is less than 5%, one says that the null
hypothesis is rejected at the 5% level of significance. Here the P-value
is printed as 0.000
which means that the P-value is smaller than 0.0005.
So it is extremely unlikely that an unbiased draw would give an observed
proportion of non-coding sequences so far from $p = 0.13.$
Test and CI for One Proportion
Test of p = 0.13 vs p ≠ 0.13
Exact
Sample X N Sample p 95% CI P-Value
1 131 723 0.181189 (0.153769, 0.211239) 0.000
Another way to interpret the output is that a 95% confidence interval
for $p$ is $(0.154, 0.211),$ which is centered at $hat p = 0.1812,$ but
does not contain $p = 0.13.$ Thus it is difficult to believe that
the sampling procedure would have given close to the true value $p = 0.13.$
Note: Yet another approach is to note that quantiles .025 and .975 of
the 'null distribution' $mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably
above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)
qbinom(c(.025,.975), 723, .13)
[1] 77 112
$endgroup$
add a comment |
$begingroup$
Your null hypothesis is $H_0: p = 0.13$ against the alternative
$H_a: p ne 0.13,$ where $p = P(text{Non Coding}).$
You observe $X =131$ non-coding sequences among $n = 728$ observed,
which gives you $hat p = 0.1812$ as the observed frequency.
Because the observed frequency is substantially different from $p = 0.13$
you wonder whether this might have been an 'unlucky' draw, or whether
you have statistically significant evidence that the method of sampling is unfair.
This is called a "one-sample binomial test". Often this test is done by
using a normal approximation to the binomial distribution. You can find
that method in elementary statistics textbooks. The output below from
Minitab statistical software uses the binomial distribution to give an
exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]
If the P-value is less than 5%, one says that the null
hypothesis is rejected at the 5% level of significance. Here the P-value
is printed as 0.000
which means that the P-value is smaller than 0.0005.
So it is extremely unlikely that an unbiased draw would give an observed
proportion of non-coding sequences so far from $p = 0.13.$
Test and CI for One Proportion
Test of p = 0.13 vs p ≠ 0.13
Exact
Sample X N Sample p 95% CI P-Value
1 131 723 0.181189 (0.153769, 0.211239) 0.000
Another way to interpret the output is that a 95% confidence interval
for $p$ is $(0.154, 0.211),$ which is centered at $hat p = 0.1812,$ but
does not contain $p = 0.13.$ Thus it is difficult to believe that
the sampling procedure would have given close to the true value $p = 0.13.$
Note: Yet another approach is to note that quantiles .025 and .975 of
the 'null distribution' $mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably
above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)
qbinom(c(.025,.975), 723, .13)
[1] 77 112
$endgroup$
Your null hypothesis is $H_0: p = 0.13$ against the alternative
$H_a: p ne 0.13,$ where $p = P(text{Non Coding}).$
You observe $X =131$ non-coding sequences among $n = 728$ observed,
which gives you $hat p = 0.1812$ as the observed frequency.
Because the observed frequency is substantially different from $p = 0.13$
you wonder whether this might have been an 'unlucky' draw, or whether
you have statistically significant evidence that the method of sampling is unfair.
This is called a "one-sample binomial test". Often this test is done by
using a normal approximation to the binomial distribution. You can find
that method in elementary statistics textbooks. The output below from
Minitab statistical software uses the binomial distribution to give an
exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]
If the P-value is less than 5%, one says that the null
hypothesis is rejected at the 5% level of significance. Here the P-value
is printed as 0.000
which means that the P-value is smaller than 0.0005.
So it is extremely unlikely that an unbiased draw would give an observed
proportion of non-coding sequences so far from $p = 0.13.$
Test and CI for One Proportion
Test of p = 0.13 vs p ≠ 0.13
Exact
Sample X N Sample p 95% CI P-Value
1 131 723 0.181189 (0.153769, 0.211239) 0.000
Another way to interpret the output is that a 95% confidence interval
for $p$ is $(0.154, 0.211),$ which is centered at $hat p = 0.1812,$ but
does not contain $p = 0.13.$ Thus it is difficult to believe that
the sampling procedure would have given close to the true value $p = 0.13.$
Note: Yet another approach is to note that quantiles .025 and .975 of
the 'null distribution' $mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably
above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)
qbinom(c(.025,.975), 723, .13)
[1] 77 112
edited Jan 11 at 8:58
answered Jan 11 at 8:38
BruceETBruceET
36.1k71540
36.1k71540
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3065525%2fsimple-statistics-probability-problem%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39
$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41
$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42
$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49
$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19