Relation between Shannon Entropy and Total Variation distance
Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$
I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?
probability-theory measure-theory information-theory
|
show 12 more comments
Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$
I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?
probability-theory measure-theory information-theory
Out of curiosity, where did that question arise?
– Clement C.
Oct 15 '15 at 23:47
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
– Michael Hardy
Oct 15 '15 at 23:53
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
– pikachuchameleon
Oct 16 '15 at 2:37
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
– Clement C.
Oct 17 '15 at 14:22
1
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
– uniquesolution
Oct 23 '15 at 21:44
|
show 12 more comments
Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$
I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?
probability-theory measure-theory information-theory
Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$
I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?
probability-theory measure-theory information-theory
probability-theory measure-theory information-theory
edited Oct 17 '15 at 2:41
asked Oct 15 '15 at 23:27
pikachuchameleon
1,124620
1,124620
Out of curiosity, where did that question arise?
– Clement C.
Oct 15 '15 at 23:47
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
– Michael Hardy
Oct 15 '15 at 23:53
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
– pikachuchameleon
Oct 16 '15 at 2:37
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
– Clement C.
Oct 17 '15 at 14:22
1
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
– uniquesolution
Oct 23 '15 at 21:44
|
show 12 more comments
Out of curiosity, where did that question arise?
– Clement C.
Oct 15 '15 at 23:47
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
– Michael Hardy
Oct 15 '15 at 23:53
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
– pikachuchameleon
Oct 16 '15 at 2:37
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
– Clement C.
Oct 17 '15 at 14:22
1
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
– uniquesolution
Oct 23 '15 at 21:44
Out of curiosity, where did that question arise?
– Clement C.
Oct 15 '15 at 23:47
Out of curiosity, where did that question arise?
– Clement C.
Oct 15 '15 at 23:47
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
– Michael Hardy
Oct 15 '15 at 23:53
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
– Michael Hardy
Oct 15 '15 at 23:53
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
– pikachuchameleon
Oct 16 '15 at 2:37
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
– pikachuchameleon
Oct 16 '15 at 2:37
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
– Clement C.
Oct 17 '15 at 14:22
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
– Clement C.
Oct 17 '15 at 14:22
1
1
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
– uniquesolution
Oct 23 '15 at 21:44
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
– uniquesolution
Oct 23 '15 at 21:44
|
show 12 more comments
1 Answer
1
active
oldest
votes
Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.
Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
begin{align}
mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}
A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.
Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.
Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.
For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$
By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$
It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have
begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
&= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
&= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
end{align}
where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$
As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$
But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.
Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1482341%2frelation-between-shannon-entropy-and-total-variation-distance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.
Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
begin{align}
mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}
A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.
Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.
Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.
For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$
By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$
It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have
begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
&= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
&= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
end{align}
where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$
As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$
But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.
Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.
add a comment |
Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.
Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
begin{align}
mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}
A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.
Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.
Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.
For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$
By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$
It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have
begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
&= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
&= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
end{align}
where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$
As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$
But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.
Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.
add a comment |
Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.
Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
begin{align}
mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}
A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.
Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.
Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.
For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$
By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$
It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have
begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
&= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
&= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
end{align}
where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$
As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$
But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.
Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.
Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.
Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
begin{align}
mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}
A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.
Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.
Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.
For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$
By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$
It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have
begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
&= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
&= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
end{align}
where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$
As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$
But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.
Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.
edited Dec 10 '18 at 21:50
answered Dec 10 '18 at 20:45
stochasticboy321
2,477617
2,477617
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1482341%2frelation-between-shannon-entropy-and-total-variation-distance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Out of curiosity, where did that question arise?
– Clement C.
Oct 15 '15 at 23:47
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
– Michael Hardy
Oct 15 '15 at 23:53
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
– pikachuchameleon
Oct 16 '15 at 2:37
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
– Clement C.
Oct 17 '15 at 14:22
1
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
– uniquesolution
Oct 23 '15 at 21:44