Simple Statistics/Probability Problem

I have used a python script to identify target sequences in a DNA sequence file.

There are two classes of sequence: coding and non-coding. I have identified $728$ sequences of interest. $597$ of these fall into the coding regions and $131$ of these fall into the non-coding regions. This is the equivalent of $18%,$ non-coding, but the total non-coding region in the sequence file is $13% $.

Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?

If the script identified sequences that were randomly distributed then $13% $ of them would have been found in the non-coding region, from a total of $728$ sequences. This seems like it should be reliable.

I hope my question is clear.

edited Jan 7 at 22:36

Zacky

7,80511062

asked Jan 7 at 21:30

Ryan_J_Hope

133

$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39

$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41

$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42

$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49

$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19

|
show 1 more comment

I have used a python script to identify target sequences in a DNA sequence file.

Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?

I hope my question is clear.

edited Jan 7 at 22:36

Zacky

7,80511062

asked Jan 7 at 21:30

Ryan_J_Hope

133

$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39

$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41

$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42

$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49

$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19

|
show 1 more comment

I have used a python script to identify target sequences in a DNA sequence file.

Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?

I hope my question is clear.

edited Jan 7 at 22:36

Zacky

7,80511062

asked Jan 7 at 21:30

Ryan_J_Hope

133

I have used a python script to identify target sequences in a DNA sequence file.

Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?

I hope my question is clear.

probability statistics biology

edited Jan 7 at 22:36

Zacky

7,80511062

asked Jan 7 at 21:30

Ryan_J_Hope

133

edited Jan 7 at 22:36

Zacky

7,80511062

asked Jan 7 at 21:30

Ryan_J_Hope

133

edited Jan 7 at 22:36

Zacky

7,80511062

edited Jan 7 at 22:36

Zacky

7,80511062

edited Jan 7 at 22:36

Zacky

7,80511062

asked Jan 7 at 21:30

Ryan_J_Hope

133

asked Jan 7 at 21:30

Ryan_J_Hope

133

asked Jan 7 at 21:30

Ryan_J_Hope

133

$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39

$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41

$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42

$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49

$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19

|
show 1 more comment

$begingroup$
I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?
$endgroup$
– Zacky
Jan 7 at 22:39

$begingroup$
Are the coding and non-coding sequences the same length?
$endgroup$
– N. F. Taussig
Jan 7 at 22:41

$begingroup$
Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:42

$begingroup$
The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.
$endgroup$
– Ryan_J_Hope
Jan 7 at 22:49

$begingroup$
This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com
$endgroup$
– awkward
Jan 8 at 15:19

I tried to make your question look better, is it okay how I did it? Also, can you clarify some things, like: Do we already know that the total non-coding region in the sequence file is $13%$ before the experiment? And that means you expected there to show only $94$ non-coding sequences (equivalent to $13%$) instead of $131$?

– Zacky
Jan 7 at 22:39

Are the coding and non-coding sequences the same length?

– N. F. Taussig
Jan 7 at 22:41

Yes, the calculation was conducted by another person and is referenced from the scientific literature. Although this is important, there could have been an error made during this calculation accounting for my discrepancy. Nevertheless, I am trying to identify specific sequences and I want to be sure that the sequences are not just background noise. If the sequences were noise then I would expect them to be evenly distributed across the whole genomic sequence file and I would find 13% o my target sequence in the non-coding region and 87% in the coding region.

– Ryan_J_Hope
Jan 7 at 22:42

The coding sequences and non coding sequences are not the same length. Although, I don't see why this would affect the result. The entire genome is made up of 13% non-coding and 87% coding. There are 3871 coding sections separated by intergenic non-coding sections.

– Ryan_J_Hope
Jan 7 at 22:49

This is more a statistics question than a mathematical one. You might get better answers by posting to cross-validated. stats.stackexchange.com

– awkward
Jan 8 at 15:19

|
show 1 more comment

1 Answer
1

active

oldest

votes

Your null hypothesis is $H_0: p = 0.13$ against the alternative
$H_a: p ne 0.13,$ where $p = P(text{Non Coding}).$
You observe $X =131$ non-coding sequences among $n = 728$ observed,
which gives you $hat p = 0.1812$ as the observed frequency.
Because the observed frequency is substantially different from $p = 0.13$
you wonder whether this might have been an 'unlucky' draw, or whether
you have statistically significant evidence that the method of sampling is unfair.

This is called a "one-sample binomial test". Often this test is done by
using a normal approximation to the binomial distribution. You can find
that method in elementary statistics textbooks. The output below from
Minitab statistical software uses the binomial distribution to give an
exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]

If the P-value is less than 5%, one says that the null
hypothesis is rejected at the 5% level of significance. Here the P-value
is printed as 0.000 which means that the P-value is smaller than 0.0005.
So it is extremely unlikely that an unbiased draw would give an observed
proportion of non-coding sequences so far from $p = 0.13.$

Test and CI for One Proportion 



Test of p = 0.13 vs p ≠ 0.13



                                                    Exact

Sample    X    N  Sample p         95% CI         P-Value

1       131  723  0.181189  (0.153769, 0.211239)    0.000

Another way to interpret the output is that a 95% confidence interval
for $p$ is $(0.154, 0.211),$ which is centered at $hat p = 0.1812,$ but
does not contain $p = 0.13.$ Thus it is difficult to believe that
the sampling procedure would have given close to the true value $p = 0.13.$

Note: Yet another approach is to note that quantiles .025 and .975 of
the 'null distribution' $mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably
above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)

 qbinom(c(.025,.975), 723, .13)

 [1]  77 112

edited Jan 11 at 8:58

answered Jan 11 at 8:38

BruceET

36.1k71540

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3065525%2fsimple-statistics-probability-problem%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Test and CI for One Proportion 



Test of p = 0.13 vs p ≠ 0.13



                                                    Exact

Sample    X    N  Sample p         95% CI         P-Value

1       131  723  0.181189  (0.153769, 0.211239)    0.000

 qbinom(c(.025,.975), 723, .13)

 [1]  77 112

edited Jan 11 at 8:58

answered Jan 11 at 8:38

BruceET

36.1k71540

add a comment |

Test and CI for One Proportion 



Test of p = 0.13 vs p ≠ 0.13



                                                    Exact

Sample    X    N  Sample p         95% CI         P-Value

1       131  723  0.181189  (0.153769, 0.211239)    0.000

 qbinom(c(.025,.975), 723, .13)

 [1]  77 112

edited Jan 11 at 8:58

answered Jan 11 at 8:38

BruceET

36.1k71540

add a comment |

Test and CI for One Proportion 



Test of p = 0.13 vs p ≠ 0.13



                                                    Exact

Sample    X    N  Sample p         95% CI         P-Value

1       131  723  0.181189  (0.153769, 0.211239)    0.000

 qbinom(c(.025,.975), 723, .13)

 [1]  77 112

edited Jan 11 at 8:58

answered Jan 11 at 8:38

BruceET

36.1k71540

Test and CI for One Proportion 



Test of p = 0.13 vs p ≠ 0.13



                                                    Exact

Sample    X    N  Sample p         95% CI         P-Value

1       131  723  0.181189  (0.153769, 0.211239)    0.000

 qbinom(c(.025,.975), 723, .13)

 [1]  77 112

edited Jan 11 at 8:58

answered Jan 11 at 8:38

BruceET

36.1k71540

edited Jan 11 at 8:58

answered Jan 11 at 8:38

BruceET

36.1k71540

answered Jan 11 at 8:38

BruceET

36.1k71540

answered Jan 11 at 8:38

BruceET

36.1k71540

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xrfgtjtk