Deriving Bayesian logistic regression












3












$begingroup$


I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:





Suppose that we have a large population of objects (such as emails), each of which is described
by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
Consider the experiment of selecting an item at random from this population.
Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
The goal of logistic regression is to find an approximation to the function
$$
x mapsto P(Y = 1 mid X = x).
$$



To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
and a scalar $b in mathbb R$
such that
$$
P(Y = 1 mid X = x) = S(a^T x + b),
$$

where $S$ is the sigmoid function defined by
$$
S(u) = frac{e^u}{1 + e^{u}}.
$$

Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.



However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
The parameters $a$ and $b$ are now viewed as being random variables.
To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.



Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.



[Here is where I think the notation gets difficult.]



Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
According to Bayes' theorem, we have



begin{align*}
&quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
=&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
end{align*}

In the last step, we assumed that $X_i, a$, and $b$ are independent.



Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
and $b$ can be found by maximizing
$$
frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
$$

with respect to $hat a$ and $hat b$.
This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)





Here are some questions:




  1. Is this derivation correct? Do you see any errors?

  2. How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.

  3. I think the statement


    Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$





        is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be

        used here. How should this statement be rephrased?



I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.










share|cite|improve this question











$endgroup$

















    3












    $begingroup$


    I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:





    Suppose that we have a large population of objects (such as emails), each of which is described
    by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
    Consider the experiment of selecting an item at random from this population.
    Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
    The goal of logistic regression is to find an approximation to the function
    $$
    x mapsto P(Y = 1 mid X = x).
    $$



    To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
    and a scalar $b in mathbb R$
    such that
    $$
    P(Y = 1 mid X = x) = S(a^T x + b),
    $$

    where $S$ is the sigmoid function defined by
    $$
    S(u) = frac{e^u}{1 + e^{u}}.
    $$

    Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.



    However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
    The parameters $a$ and $b$ are now viewed as being random variables.
    To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.



    Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.



    [Here is where I think the notation gets difficult.]



    Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
    According to Bayes' theorem, we have



    begin{align*}
    &quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
    propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
    =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
    =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
    =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
    end{align*}

    In the last step, we assumed that $X_i, a$, and $b$ are independent.



    Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
    and $b$ can be found by maximizing
    $$
    frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
    $$

    with respect to $hat a$ and $hat b$.
    This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)





    Here are some questions:




    1. Is this derivation correct? Do you see any errors?

    2. How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.

    3. I think the statement


      Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$





            is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be

            used here. How should this statement be rephrased?



    I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.










    share|cite|improve this question











    $endgroup$















      3












      3








      3


      1



      $begingroup$


      I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:





      Suppose that we have a large population of objects (such as emails), each of which is described
      by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
      Consider the experiment of selecting an item at random from this population.
      Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
      The goal of logistic regression is to find an approximation to the function
      $$
      x mapsto P(Y = 1 mid X = x).
      $$



      To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
      and a scalar $b in mathbb R$
      such that
      $$
      P(Y = 1 mid X = x) = S(a^T x + b),
      $$

      where $S$ is the sigmoid function defined by
      $$
      S(u) = frac{e^u}{1 + e^{u}}.
      $$

      Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.



      However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
      The parameters $a$ and $b$ are now viewed as being random variables.
      To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.



      Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.



      [Here is where I think the notation gets difficult.]



      Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
      According to Bayes' theorem, we have



      begin{align*}
      &quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
      propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
      =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
      =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
      =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
      end{align*}

      In the last step, we assumed that $X_i, a$, and $b$ are independent.



      Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
      and $b$ can be found by maximizing
      $$
      frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
      $$

      with respect to $hat a$ and $hat b$.
      This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)





      Here are some questions:




      1. Is this derivation correct? Do you see any errors?

      2. How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.

      3. I think the statement


        Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$





              is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be

              used here. How should this statement be rephrased?



      I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.










      share|cite|improve this question











      $endgroup$




      I'm attempting to understand Bayesian logistic regression clearly, and I'm uncertain about (among other things) what is the most clear or most correct notation to use. Here is my current attempt at explaining Bayesian logistic regression:





      Suppose that we have a large population of objects (such as emails), each of which is described
      by a feature vector $x in mathbb R^p$ and a class label $y in {0,1}$ (spam or not spam, for example).
      Consider the experiment of selecting an item at random from this population.
      Let $X$ and $Y$ be the feature vector and the label of the selected item (so $X$ and $Y$ are random variables).
      The goal of logistic regression is to find an approximation to the function
      $$
      x mapsto P(Y = 1 mid X = x).
      $$



      To achieve this goal, logistic regression makes a modeling assumption that there exists a vector $a in mathbb R^p$
      and a scalar $b in mathbb R$
      such that
      $$
      P(Y = 1 mid X = x) = S(a^T x + b),
      $$

      where $S$ is the sigmoid function defined by
      $$
      S(u) = frac{e^u}{1 + e^{u}}.
      $$

      Logistic regression computes the parameters $a$ and $b$ using maximum likelihood estimation.



      However, we can also take a Bayesian approach to estimating $a$ and $b$ by incorporating prior beliefs about the values of $a$ and $b$.
      The parameters $a$ and $b$ are now viewed as being random variables.
      To be concrete, let's make the assumption that $b$ and the components of $a$ are independent and normally distributed with mean $0$ and variance $sigma^2$. The joint PDF of $a$ and $b$ will be denoted by $f_{a,b}$.



      Consider the experiment of selecting $n$ objects at random (with replacement) from the population. Let $X_i$ and $Y_i$ be the feature vector and corresponding class label for the $i$th object selected (so $X_i$ and $Y_i$ are random variables). Let $x_i$ and $y_i$ be the specific values of $X_i$ and $Y_i$ that we observe when we collect our training data.



      [Here is where I think the notation gets difficult.]



      Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$. Let $f_{X_i,Y_i}$ be the joint PDF of $X_i$ and $Y_i$, and let $f_{X_i}$ be the PDF of $X_i$.
      According to Bayes' theorem, we have



      begin{align*}
      &quad f_{a,b}(hat a, hat b mid (X_i,Y_i) = (x_i,y_i) text{ for } i = 1,ldots, n) \
      propto &quad f(x_1,y_1,ldots,x_n,y_n mid a = hat a, b = hat b) f_{a,b}(hat a, hat b) \
      =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n f_{X_i,Y_i}(x_i,y_i mid a = hat a, b = hat b) \
      =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i mid a = hat a, b = hat b) \
      =&quad f_{a,b}(hat a, hat b) Pi_{i=1}^n P(Y_i = y_i mid X_i = x_i, a = hat a, b = hat b) f_{X_i}(x_i). \
      end{align*}

      In the last step, we assumed that $X_i, a$, and $b$ are independent.



      Now taking the natural log of the final expression and omitting terms that do not depend on $hat a$ or $hat b$, we see that a maximum a posteriori estimate of $a$
      and $b$ can be found by maximizing
      $$
      frac{-| hat a |^2 - hat b^2}{2 sigma^2} + sum_{i=1}^n log(S(hat a^T x_i + hat b))
      $$

      with respect to $hat a$ and $hat b$.
      This is a convex optimization problem, which can be solved using a method such as gradient ascent. (Further simplifications to the objective function are possible, but I'm not concerned with how to explain the remaining steps.)





      Here are some questions:




      1. Is this derivation correct? Do you see any errors?

      2. How can the notation be improved, clarified, or simplified? I want the notation to be perfectly correct (by the standards of a mathematician) but also clear.

      3. I think the statement


        Let $f$ be the PDF of the random variable $(X_1,Y_1,ldots,X_n,Y_n)$





              is incorrect because the random variables $Y_i$ are not continuous, so the term "PDF" can't be

              used here. How should this statement be rephrased?



      I'd be interested in hearing any comments or opinions about how to clarify the above derivation, including comments about picky details or tangential comments.







      probability statistics regression






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Dec 31 '18 at 11:20







      littleO

















      asked Dec 29 '18 at 12:11









      littleOlittleO

      29.9k646109




      29.9k646109






















          1 Answer
          1






          active

          oldest

          votes


















          2





          +200







          $begingroup$

          Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:



          1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.



          It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.



          The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.



          2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
          $$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
          Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
          $$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
          where $s(u) := e^u / (1 + e^u)$ is the logistic function.



          The joint probability distribution is then
          $$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$



          And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".



          3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
          begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}



          4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
          $$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
          by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.



          One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find



          $$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
          This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)



          But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)



          In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.



          The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
          $$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$



          where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.



          Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.



          So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)



          4a. Making Bayesian predictions.
          Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:



          $$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$



          I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!



          And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.



          A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:46










          • $begingroup$
            @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:49










          • $begingroup$
            I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:56










          • $begingroup$
            @littleO Good luck!
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:56











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "69"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3055784%2fderiving-bayesian-logistic-regression%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2





          +200







          $begingroup$

          Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:



          1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.



          It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.



          The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.



          2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
          $$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
          Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
          $$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
          where $s(u) := e^u / (1 + e^u)$ is the logistic function.



          The joint probability distribution is then
          $$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$



          And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".



          3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
          begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}



          4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
          $$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
          by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.



          One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find



          $$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
          This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)



          But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)



          In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.



          The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
          $$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$



          where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.



          Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.



          So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)



          4a. Making Bayesian predictions.
          Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:



          $$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$



          I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!



          And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.



          A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:46










          • $begingroup$
            @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:49










          • $begingroup$
            I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:56










          • $begingroup$
            @littleO Good luck!
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:56
















          2





          +200







          $begingroup$

          Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:



          1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.



          It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.



          The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.



          2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
          $$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
          Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
          $$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
          where $s(u) := e^u / (1 + e^u)$ is the logistic function.



          The joint probability distribution is then
          $$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$



          And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".



          3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
          begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}



          4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
          $$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
          by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.



          One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find



          $$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
          This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)



          But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)



          In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.



          The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
          $$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$



          where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.



          Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.



          So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)



          4a. Making Bayesian predictions.
          Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:



          $$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$



          I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!



          And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.



          A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:46










          • $begingroup$
            @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:49










          • $begingroup$
            I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:56










          • $begingroup$
            @littleO Good luck!
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:56














          2





          +200







          2





          +200



          2




          +200



          $begingroup$

          Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:



          1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.



          It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.



          The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.



          2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
          $$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
          Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
          $$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
          where $s(u) := e^u / (1 + e^u)$ is the logistic function.



          The joint probability distribution is then
          $$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$



          And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".



          3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
          begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}



          4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
          $$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
          by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.



          One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find



          $$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
          This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)



          But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)



          In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.



          The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
          $$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$



          where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.



          Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.



          So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)



          4a. Making Bayesian predictions.
          Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:



          $$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$



          I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!



          And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.



          A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.






          share|cite|improve this answer











          $endgroup$



          Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:



          1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $sigma^2_a$ and $sigma^2_b$ for $mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $sigma^2_b$.



          It would also make notation simpler later on if we incorporate $b$ into $mathbf a$, as an extra component. In other words, $mathbf a_{rm mine} = (mathbf a_{rm yours}, b) in mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $mathbf x_i$, i.e. $mathbf x_{i, rm mine} = (mathbf x_{i, rm yours}, 1) in mathbf R^{p + 1}$.



          The prior variance for $mathbf a$ is then a $(p+1)times(p+1)$ diagonal matrix $mathbf Sigma$, whose first $p$ diagonal entries are $sigma_a^2$ and whose final entry is $sigma_b^2$.



          2. The $mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $mathbf a in mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed:
          $$ P(mathbf a) = mathcal N(mathbf a | mathbf 0, mathbf Sigma ). $$
          Given the value of $mathbf a$, and given the feature vectors $mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form:
          $$ P(y_i | mathbf x_i, mathbf a) = begin{cases} s(mathbf a .mathbf x_i) & {rm if }y_i = 1 \ 1 - s(mathbf a . mathbf x_i ) & {rm if } y_i = 0end{cases},$$
          where $s(u) := e^u / (1 + e^u)$ is the logistic function.



          The joint probability distribution is then
          $$ P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ) = P(mathbf a)prod_{i=1}^N P(y_i | mathbf x_i , mathbf a).$$



          And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".



          3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be:
          begin{multline} log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) \ = {rm const} + frac 1 2 mathbf a mathbf Sigma^{-1} mathbf a + sum_{i}left(y_i log s (mathbf a_i . mathbf x_i) + (1 - y_i) log (1 - s(mathbf a_i . mathbf x_i))right)end{multline}



          4. Information to extract from the model: going beyond point estimates. Given the training set $(mathbf x_1, y_1), dots, (mathbf x_N, y_N)$, the posterior distribution for the parameters is
          $$ P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) propto P(y_1, dots, y_N | mathbf x_1, dots, mathbf x_N, mathbf a ), $$
          by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.



          One perfectly valid question to ask is: what value of $mathbf a$ maximises the posterior probability? In other words, the task is to find



          $$ mathbf a_{rm MAP} = {rm argmax}_{mathbf a} P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) .$$
          This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${rm MAP}$ stands for maximum a posteriori, by the way.)



          But this only gives us a point estimate for $mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)



          In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $mathbf a$! The distribution $P(mathbf a| y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.



          The posterior distribution $P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order:
          $$ log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N) approx {rm const } + frac 1 2 mathbf (mathbf a - mathbf a_{rm MAP})^T mathbf H mathbf (mathbf a - mathbf a_{rm MAP}),$$



          where $mathbf H$ is the Hessian matrix for $log P(mathbf a | y_1, dots, y_N, mathbf x_1, dots, mathbf x_N)$, evaluated at $mathbf a_{rm MAP}$.



          Exponentiating this, we see that $mathbf a$ is approximately normally distributed, with mean $mathbf a_{rm MAP}$ and variance $mathbf H^{-1}$. Thus we have obtained not just a point estimate for $mathbf a$, but also some estimate of the uncertainty.



          So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $mathbf a_{rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)



          4a. Making Bayesian predictions.
          Now, suppose we receive a new datapoint, with feature vector $mathbf x_{rm new}$. Our task is to predict the probability of the corresponding label $y_{rm new}$ being $1$. According to the Bayesian approach, the prediction should be:



          $$ P(y_{rm new} = 1 | mathbf x_{rm new}, x_1, dots, x_N, y_1, dots, y_N) = int dmathbf a P(y_{rm new} | mathbf x_{rm new}, mathbf a) P(mathbf a | y_1, dots, y_N, x_1, dots, x_N). $$



          I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!



          And I want to emphasise that we're not just using the point estimate $mathbf a_{rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $mathbf a$, but assigning higher weight to the predictions coming from choices of $mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.



          A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Dec 31 '18 at 23:38

























          answered Dec 31 '18 at 21:52









          Kenny WongKenny Wong

          19k21440




          19k21440












          • $begingroup$
            Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:46










          • $begingroup$
            @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:49










          • $begingroup$
            I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:56










          • $begingroup$
            @littleO Good luck!
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:56


















          • $begingroup$
            Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:46










          • $begingroup$
            @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:49










          • $begingroup$
            I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
            $endgroup$
            – littleO
            Dec 31 '18 at 23:56










          • $begingroup$
            @littleO Good luck!
            $endgroup$
            – Kenny Wong
            Dec 31 '18 at 23:56
















          $begingroup$
          Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
          $endgroup$
          – littleO
          Dec 31 '18 at 23:46




          $begingroup$
          Thank you, very helpful. Point 3 was a blatant error on my part, glad you caught that.
          $endgroup$
          – littleO
          Dec 31 '18 at 23:46












          $begingroup$
          @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
          $endgroup$
          – Kenny Wong
          Dec 31 '18 at 23:49




          $begingroup$
          @littleO Out of interest, who are you teaching this to? Mostly computer scientists? Or statisticians? (I took an interest in your question because I have to explain something similar to people at work next week, and find it helpful trying to formulate these ideas on math.SE. My colleagues at work like to do things the point-estimate / optimisation way, so the Bayesian approach will be something new!)
          $endgroup$
          – Kenny Wong
          Dec 31 '18 at 23:49












          $begingroup$
          I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
          $endgroup$
          – littleO
          Dec 31 '18 at 23:56




          $begingroup$
          I'll be teaching this to math and data science majors. Yeah, the point 4 material is very interesting.
          $endgroup$
          – littleO
          Dec 31 '18 at 23:56












          $begingroup$
          @littleO Good luck!
          $endgroup$
          – Kenny Wong
          Dec 31 '18 at 23:56




          $begingroup$
          @littleO Good luck!
          $endgroup$
          – Kenny Wong
          Dec 31 '18 at 23:56


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Mathematics Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3055784%2fderiving-bayesian-logistic-regression%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Bressuire

          Cabo Verde

          Gyllenstierna