ELI5: The Logic Behind Coefficient Estimation in OLS Regression
Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.
My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:
$$ beta = (X'X)^{-1}X'Y $$
Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.
regression theory
add a comment |
Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.
My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:
$$ beta = (X'X)^{-1}X'Y $$
Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.
regression theory
7
How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52
add a comment |
Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.
My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:
$$ beta = (X'X)^{-1}X'Y $$
Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.
regression theory
Like a lot of people, I understand how to run a linear regression, I understand how to interpret its output, and I understand its limitations.
My understanding of the mathematical underpinnings of linear regression, however, are less developed. In particular, I do not understand the logic behind how we estimate beta using the following formula:
$$ beta = (X'X)^{-1}X'Y $$
Would anyone care to offer an intuitive explanation as to why/how this process works? For example, what function each step in the equation performs and why it is necessary.
regression theory
regression theory
asked Dec 11 '18 at 10:57
Jack Bailey
564
564
7
How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52
add a comment |
7
How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52
7
7
How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52
How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52
add a comment |
2 Answers
2
active
oldest
votes
Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.
If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.
In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.
Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
add a comment |
If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:
OLS is aiming to minimize the error $||y-Xbeta||$.
The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)
The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.
Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f381432%2feli5-the-logic-behind-coefficient-estimation-in-ols-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.
If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.
In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.
Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
add a comment |
Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.
If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.
In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.
Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
add a comment |
Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.
If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.
In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.
Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.
Suppose you have a model of the form:
$$X beta= Y$$
where X is a normal 2-D matrix, for ease of visualisation.
Now, if the matrix $X$ is square and invertible, then getting $beta$ is trivial:
$$beta= X^{-1}Y$$
And that would be the end of it.
If this is not the case, to get $beta$ you’ll have to find a way to “approximate” the result of an inverse matrix. $X^dagger = (X'X)^{-1}X'$ is called the (left)-pseudoinverse, and it has some nice properties that make it useful for this application.
In particular, it is unique, and $XX^dagger X=X$, so it kind of works like an inverse matrix would $(XX^{-1}X = XI = X)$. Also, for an invertible and square matrix (i.e. if the inverse matrix exists), it is equal to $X^{-1}$.
Also it gets the shape of the matrix right: If $X$ has order $n times m$, our pseudoinverse should be $m times n$ so we can multiply it with $Y$. This is achieved by multiplying $(X'X)^{-1}$, which is square $(m times m)$, with X' $(m times n)$.
edited Dec 11 '18 at 12:15
answered Dec 11 '18 at 12:08
Purple Rover
735
735
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
add a comment |
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
Thanks for your time. This was a great explanation and really useful.
– Jack Bailey
Dec 11 '18 at 14:40
add a comment |
If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:
OLS is aiming to minimize the error $||y-Xbeta||$.
The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)
The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.
Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
add a comment |
If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:
OLS is aiming to minimize the error $||y-Xbeta||$.
The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)
The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.
Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
add a comment |
If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:
OLS is aiming to minimize the error $||y-Xbeta||$.
The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)
The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.
Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.
If you look at sources such as wikipedia, there are some good explanations for where this comes from. Here are some cores ideas:
OLS is aiming to minimize the error $||y-Xbeta||$.
The norm of a vector is minimized when its derivative is perpendicular to the vector. (Since you asked for ELI5, I won't go into a rigorous formulation of "derivative" in this context.)
The error is given in terms of $y$, $X$, and $beta$. The first two are constants; we're varying only $beta$. Thus, the derivative can be treated as being $Xbeta'$, so we're looking for $(Xbeta')^T(y-Xbeta)=0$. This is equivalent to $(beta')^TX^Ty=(beta')^TX^TXbeta$. If we cancel the $(beta')^T$ from both sides (normally in linear algebra, you can't just go around canceling things, but I'm not aiming for perfect rigor here, so I won't get into the justification), we're left with $X^Ty=X^TXbeta$. Now, $X^T$ isn't invertible (it isn't even square), so we can't cancel it out, but it does turn out that $X^TX$ must be invertible (assuming that the features are linearly independent). So we can get $beta = (X^TX)^{-1}X^Ty$.
Going back to $X^Ty=X^TXbeta$, recall that $Xbeta$ is the the estimate $hat y$ that is calculated from a given $beta$. $X^Ty$ is a vector in which each entry is the dot product of one of the features with the response. So we have that $X^Ty=X^That y$, i.e., for each feature, the dot product between that feature and the actual response is equal to the dot product between that feature and the estimated response. $forall i, x_i^Ty=x_i^That y$. We can view OLS, then, as solving $n$ equations $x_i^Ty=x_i^That y$, where $n$ is the number of features. So to see why this works, we just need to show that a solution exists, and that any estimate of the response other than this solution will have larger squared error.
answered Dec 11 '18 at 19:43
Acccumulation
1,54826
1,54826
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
add a comment |
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
Thanks! Another good answer.
– Jack Bailey
Dec 12 '18 at 11:42
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f381432%2feli5-the-logic-behind-coefficient-estimation-in-ols-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
7
How many five year olds have learned anything about algebra, let alone matrices? I don't think it's a feasible request. Better to be clear about what kind/level of explanation you realistically seek. It would also help to clarify what it is you seek (that's not especially clear); are you asking for some outline explanation of how the formula is derived, or why a formula something like that makes sense?
– Glen_b♦
Dec 11 '18 at 11:52