ELI5: The Logic Behind Coefficient Estimation in OLS Regression

Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.

My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:

$$ beta = (X'X)^{-1}X'Y $$

Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.

asked Dec 11 '18 at 10:57

Jack Bailey

564

7

How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52

add a comment |

Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.

$$ beta = (X'X)^{-1}X'Y $$

Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.

asked Dec 11 '18 at 10:57

Jack Bailey

564

7

How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52

add a comment |

Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.

$$ beta = (X'X)^{-1}X'Y $$

Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.

asked Dec 11 '18 at 10:57

Jack Bailey

564

Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.

$$ beta = (X'X)^{-1}X'Y $$

Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.

regression theory

asked Dec 11 '18 at 10:57

Jack Bailey

564

asked Dec 11 '18 at 10:57

Jack Bailey

564

asked Dec 11 '18 at 10:57

Jack Bailey

564

asked Dec 11 '18 at 10:57

Jack Bailey

564

asked Dec 11 '18 at 10:57

Jack Bailey

564

7

How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52

add a comment |

7

How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52

How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52

add a comment |

2 Answers
2

active

oldest

votes

Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.

If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.

In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.

Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.

edited Dec 11 '18 at 12:15

answered Dec 11 '18 at 12:08

Purple Rover

735

Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40

add a comment |

If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:

OLS is aiming to minimize the error $||y-Xbeta||$.

The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)

The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.

Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f381432%2feli5-the-logic-behind-coefficient-estimation-in-ols-regression%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

edited Dec 11 '18 at 12:15

answered Dec 11 '18 at 12:08

Purple Rover

735

Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40

add a comment |

edited Dec 11 '18 at 12:15

answered Dec 11 '18 at 12:08

Purple Rover

735

Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40

add a comment |

edited Dec 11 '18 at 12:15

answered Dec 11 '18 at 12:08

Purple Rover

735

edited Dec 11 '18 at 12:15

answered Dec 11 '18 at 12:08

Purple Rover

735

edited Dec 11 '18 at 12:15

answered Dec 11 '18 at 12:08

Purple Rover

735

answered Dec 11 '18 at 12:08

Purple Rover

735

answered Dec 11 '18 at 12:08

Purple Rover

735

Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40

add a comment |

Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40

Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40

add a comment |

If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:

OLS is aiming to minimize the error $||y-Xbeta||$.

The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)

The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42

add a comment |

If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:

OLS is aiming to minimize the error $||y-Xbeta||$.

The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)

The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42

add a comment |

If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:

OLS is aiming to minimize the error $||y-Xbeta||$.

The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)

The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:

OLS is aiming to minimize the error $||y-Xbeta||$.

The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)

The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

answered Dec 11 '18 at 19:43

Acccumulation

1,54826

Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42

add a comment |

Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42

Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

fSNJkEmE Xge

搜尋此網誌

Xrfgtjtk