Deriving Bayesian logistic regression
$begingroup$
I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:
Suppose that we have a large population of objects (such as emails), each of which is described
by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
Consider the experiment of selecting an item at random from this population.
Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
The goal of logistic regression is to find an approximation to the function
$$
x mapsto P(Y = 1 mid X = x).
$$
To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
and a scalar $b in mathbb R$
such that
$$
P(Y = 1 mid X = x) = S(a^T x + b),
$$
where $S$ is the sigmoid function defined by
$$
S(u) = frac{e^u}{1 + e^{u}}.
$$
Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.
However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
The parameters $a$ and $b$ are now viewed as being random variables.
To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.
Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.
[Here is where I think the notation gets difficult.]
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
According to Bayes' theorem, we have
begin{align*}
&quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
end{align*}
In the last step, we assumed that $X_i, a$, and $b$ are independent.
Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
and $b$ can be found by maximizing
$$
frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
$$
with respect to $hat a$ and $hat b$.
This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)
Here are some questions:
- Is this derivation correct? Do you see any errors?
- How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.
- I think the statement
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$
is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be
used here. How should this statement be rephrased?
I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.
probability statistics regression
$endgroup$
add a comment |
$begingroup$
I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:
Suppose that we have a large population of objects (such as emails), each of which is described
by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
Consider the experiment of selecting an item at random from this population.
Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
The goal of logistic regression is to find an approximation to the function
$$
x mapsto P(Y = 1 mid X = x).
$$
To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
and a scalar $b in mathbb R$
such that
$$
P(Y = 1 mid X = x) = S(a^T x + b),
$$
where $S$ is the sigmoid function defined by
$$
S(u) = frac{e^u}{1 + e^{u}}.
$$
Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.
However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
The parameters $a$ and $b$ are now viewed as being random variables.
To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.
Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.
[Here is where I think the notation gets difficult.]
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
According to Bayes' theorem, we have
begin{align*}
&quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
end{align*}
In the last step, we assumed that $X_i, a$, and $b$ are independent.
Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
and $b$ can be found by maximizing
$$
frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
$$
with respect to $hat a$ and $hat b$.
This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)
Here are some questions:
- Is this derivation correct? Do you see any errors?
- How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.
- I think the statement
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$
is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be
used here. How should this statement be rephrased?
I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.
probability statistics regression
$endgroup$
add a comment |
$begingroup$
I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:
Suppose that we have a large population of objects (such as emails), each of which is described
by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
Consider the experiment of selecting an item at random from this population.
Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
The goal of logistic regression is to find an approximation to the function
$$
x mapsto P(Y = 1 mid X = x).
$$
To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
and a scalar $b in mathbb R$
such that
$$
P(Y = 1 mid X = x) = S(a^T x + b),
$$
where $S$ is the sigmoid function defined by
$$
S(u) = frac{e^u}{1 + e^{u}}.
$$
Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.
However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
The parameters $a$ and $b$ are now viewed as being random variables.
To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.
Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.
[Here is where I think the notation gets difficult.]
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
According to Bayes' theorem, we have
begin{align*}
&quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
end{align*}
In the last step, we assumed that $X_i, a$, and $b$ are independent.
Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
and $b$ can be found by maximizing
$$
frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
$$
with respect to $hat a$ and $hat b$.
This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)
Here are some questions:
- Is this derivation correct? Do you see any errors?
- How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.
- I think the statement
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$
is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be
used here. How should this statement be rephrased?
I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.
probability statistics regression
$endgroup$
I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:
Suppose that we have a large population of objects (such as emails), each of which is described
by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
Consider the experiment of selecting an item at random from this population.
Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
The goal of logistic regression is to find an approximation to the function
$$
x mapsto P(Y = 1 mid X = x).
$$
To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
and a scalar $b in mathbb R$
such that
$$
P(Y = 1 mid X = x) = S(a^T x + b),
$$
where $S$ is the sigmoid function defined by
$$
S(u) = frac{e^u}{1 + e^{u}}.
$$
Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.
However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
The parameters $a$ and $b$ are now viewed as being random variables.
To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.
Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.
[Here is where I think the notation gets difficult.]
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
According to Bayes' theorem, we have
begin{align*}
&quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
end{align*}
In the last step, we assumed that $X_i, a$, and $b$ are independent.
Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
and $b$ can be found by maximizing
$$
frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
$$
with respect to $hat a$ and $hat b$.
This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)
Here are some questions:
- Is this derivation correct? Do you see any errors?
- How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.
- I think the statement
Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$
is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be
used here. How should this statement be rephrased?
I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.
probability statistics regression
probability statistics regression
edited Dec 31 '18 at 11:20
littleO
asked Dec 29 '18 at 12:11
littleOlittleO
29.9k646109
29.9k646109
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:
1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.
It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.
The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.
2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
$$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
$$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
where $s(u) := e^u / (1 + e^u)$ is the logistic function.
The joint probability distribution is then
$$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$
And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".
3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}
4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
$$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.
One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find
$$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)
But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)
In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.
The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
$$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$
where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.
Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.
So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)
4a. Making Bayesian predictions.
Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:
$$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$
I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!
And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.
A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.
$endgroup$
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3055784%2fderiving-bayesian-logistic-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:
1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.
It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.
The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.
2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
$$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
$$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
where $s(u) := e^u / (1 + e^u)$ is the logistic function.
The joint probability distribution is then
$$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$
And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".
3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}
4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
$$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.
One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find
$$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)
But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)
In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.
The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
$$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$
where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.
Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.
So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)
4a. Making Bayesian predictions.
Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:
$$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$
I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!
And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.
A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.
$endgroup$
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
add a comment |
$begingroup$
Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:
1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.
It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.
The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.
2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
$$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
$$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
where $s(u) := e^u / (1 + e^u)$ is the logistic function.
The joint probability distribution is then
$$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$
And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".
3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}
4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
$$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.
One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find
$$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)
But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)
In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.
The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
$$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$
where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.
Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.
So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)
4a. Making Bayesian predictions.
Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:
$$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$
I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!
And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.
A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.
$endgroup$
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
add a comment |
$begingroup$
Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:
1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.
It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.
The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.
2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
$$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
$$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
where $s(u) := e^u / (1 + e^u)$ is the logistic function.
The joint probability distribution is then
$$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$
And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".
3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}
4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
$$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.
One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find
$$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)
But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)
In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.
The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
$$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$
where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.
Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.
So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)
4a. Making Bayesian predictions.
Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:
$$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$
I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!
And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.
A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.
$endgroup$
Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:
1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.
It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.
The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.
2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
$$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
$$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
where $s(u) := e^u / (1 + e^u)$ is the logistic function.
The joint probability distribution is then
$$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$
And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".
3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}
4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
$$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.
One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find
$$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)
But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)
In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.
The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
$$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$
where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.
Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.
So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)
4a. Making Bayesian predictions.
Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:
$$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$
I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!
And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.
A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.
edited Dec 31 '18 at 23:38
answered Dec 31 '18 at 21:52
Kenny WongKenny Wong
19k21440
19k21440
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
add a comment |
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
$endgroup$
– littleO
Dec 31 '18 at 23:46
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
@littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:49
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
$endgroup$
– littleO
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
$begingroup$
@littleO Good luck!
$endgroup$
– Kenny Wong
Dec 31 '18 at 23:56
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3055784%2fderiving-bayesian-logistic-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown