What is the motivation of proximal mapping?
$begingroup$
For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?
convex-analysis convex-optimization
$endgroup$
add a comment |
$begingroup$
For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?
convex-analysis convex-optimization
$endgroup$
$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01
$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10
$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59
$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42
add a comment |
$begingroup$
For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?
convex-analysis convex-optimization
$endgroup$
For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?
convex-analysis convex-optimization
convex-analysis convex-optimization
edited Dec 17 '18 at 12:45
littleO
29.6k645109
29.6k645109
asked Jul 8 '12 at 6:40
miningmining
465513
465513
$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01
$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10
$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59
$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42
add a comment |
$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01
$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10
$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59
$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42
$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01
$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01
$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10
$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10
$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59
$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59
$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42
$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.
$endgroup$
add a comment |
$begingroup$
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
$$
text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
$$
When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)
You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
$$
text{minimize} quad g(x) + h(x)
$$
where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
$$
x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
$$
(This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
$$
x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
$$
Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
begin{align}
x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
&= text{prox}_{th}(x^k - tnabla g(x^k)).
end{align}
This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
$$
tag{1} 0 in partial h(x).
$$
Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
begin{align}
& 0 in partial h(x) \
iff & x in (I + t partial h)(x) \
iff & x in (I + t partial h)^{-1}(x).
end{align}
The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving
$$
x = (I + t partial h)^{-1}(x).
$$
A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
$$
x^{k+1} = (I + t partial h)^{-1}(x^k).
$$
Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f168159%2fwhat-is-the-motivation-of-proximal-mapping%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.
$endgroup$
add a comment |
$begingroup$
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.
$endgroup$
add a comment |
$begingroup$
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.
$endgroup$
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.
answered Nov 2 '18 at 20:10
MoormanlyMoormanly
1314
1314
add a comment |
add a comment |
$begingroup$
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
$$
text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
$$
When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)
You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
$$
text{minimize} quad g(x) + h(x)
$$
where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
$$
x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
$$
(This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
$$
x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
$$
Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
begin{align}
x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
&= text{prox}_{th}(x^k - tnabla g(x^k)).
end{align}
This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
$$
tag{1} 0 in partial h(x).
$$
Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
begin{align}
& 0 in partial h(x) \
iff & x in (I + t partial h)(x) \
iff & x in (I + t partial h)^{-1}(x).
end{align}
The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving
$$
x = (I + t partial h)^{-1}(x).
$$
A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
$$
x^{k+1} = (I + t partial h)^{-1}(x^k).
$$
Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
$endgroup$
add a comment |
$begingroup$
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
$$
text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
$$
When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)
You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
$$
text{minimize} quad g(x) + h(x)
$$
where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
$$
x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
$$
(This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
$$
x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
$$
Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
begin{align}
x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
&= text{prox}_{th}(x^k - tnabla g(x^k)).
end{align}
This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
$$
tag{1} 0 in partial h(x).
$$
Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
begin{align}
& 0 in partial h(x) \
iff & x in (I + t partial h)(x) \
iff & x in (I + t partial h)^{-1}(x).
end{align}
The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving
$$
x = (I + t partial h)^{-1}(x).
$$
A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
$$
x^{k+1} = (I + t partial h)^{-1}(x^k).
$$
Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
$endgroup$
add a comment |
$begingroup$
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
$$
text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
$$
When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)
You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
$$
text{minimize} quad g(x) + h(x)
$$
where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
$$
x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
$$
(This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
$$
x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
$$
Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
begin{align}
x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
&= text{prox}_{th}(x^k - tnabla g(x^k)).
end{align}
This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
$$
tag{1} 0 in partial h(x).
$$
Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
begin{align}
& 0 in partial h(x) \
iff & x in (I + t partial h)(x) \
iff & x in (I + t partial h)^{-1}(x).
end{align}
The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving
$$
x = (I + t partial h)^{-1}(x).
$$
A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
$$
x^{k+1} = (I + t partial h)^{-1}(x^k).
$$
Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
$endgroup$
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
$$
text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
$$
When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)
You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
$$
text{minimize} quad g(x) + h(x)
$$
where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
$$
x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
$$
(This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
$$
x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
$$
Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
begin{align}
x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
&= text{prox}_{th}(x^k - tnabla g(x^k)).
end{align}
This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
$$
tag{1} 0 in partial h(x).
$$
Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
begin{align}
& 0 in partial h(x) \
iff & x in (I + t partial h)(x) \
iff & x in (I + t partial h)^{-1}(x).
end{align}
The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving
$$
x = (I + t partial h)^{-1}(x).
$$
A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
$$
x^{k+1} = (I + t partial h)^{-1}(x^k).
$$
Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
edited Dec 17 '18 at 12:47
answered Dec 17 '18 at 12:37
littleOlittleO
29.6k645109
29.6k645109
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f168159%2fwhat-is-the-motivation-of-proximal-mapping%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01
$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10
$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59
$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42