What is the motivation of proximal mapping?












6












$begingroup$


For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?










share|cite|improve this question











$endgroup$












  • $begingroup$
    Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 10:01












  • $begingroup$
    Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 16:10










  • $begingroup$
    Thank you very much. I have ordered one copy.
    $endgroup$
    – mining
    Jul 8 '12 at 23:59










  • $begingroup$
    I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
    $endgroup$
    – rnegrinho
    Jun 12 '14 at 11:42
















6












$begingroup$


For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?










share|cite|improve this question











$endgroup$












  • $begingroup$
    Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 10:01












  • $begingroup$
    Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 16:10










  • $begingroup$
    Thank you very much. I have ordered one copy.
    $endgroup$
    – mining
    Jul 8 '12 at 23:59










  • $begingroup$
    I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
    $endgroup$
    – rnegrinho
    Jun 12 '14 at 11:42














6












6








6


1



$begingroup$


For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?










share|cite|improve this question











$endgroup$




For a convex function $h$, its proximal operator is defined as:
$$operatorname{prox}_h(x)=argmin_u Big(h(u)+frac{1}{2}|u-x|^2Big)$$
Can anyone provide an intuitive explanation/motivation of proximal mapping?







convex-analysis convex-optimization






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Dec 17 '18 at 12:45









littleO

29.6k645109




29.6k645109










asked Jul 8 '12 at 6:40









miningmining

465513




465513












  • $begingroup$
    Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 10:01












  • $begingroup$
    Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 16:10










  • $begingroup$
    Thank you very much. I have ordered one copy.
    $endgroup$
    – mining
    Jul 8 '12 at 23:59










  • $begingroup$
    I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
    $endgroup$
    – rnegrinho
    Jun 12 '14 at 11:42


















  • $begingroup$
    Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 10:01












  • $begingroup$
    Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
    $endgroup$
    – copper.hat
    Jul 8 '12 at 16:10










  • $begingroup$
    Thank you very much. I have ordered one copy.
    $endgroup$
    – mining
    Jul 8 '12 at 23:59










  • $begingroup$
    I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
    $endgroup$
    – rnegrinho
    Jun 12 '14 at 11:42
















$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01






$begingroup$
Not an answer, but... If you let $g(x) = frac{1}{2} ||x||^2$, then the function $h square g (x) = inf_{u} h(u)+g(x-u)$ is called the infimal convolution and corresponds to the function you get from the set $mathbb{epi}, h + mathbb{epi} , g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis.
$endgroup$
– copper.hat
Jul 8 '12 at 10:01














$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10




$begingroup$
Oops, I should have been more explicit, $mathbb{epi} , h square g = mathbb{epi} ,h + mathbb{epi} , g$. The epigraph defines the function.
$endgroup$
– copper.hat
Jul 8 '12 at 16:10












$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59




$begingroup$
Thank you very much. I have ordered one copy.
$endgroup$
– mining
Jul 8 '12 at 23:59












$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42




$begingroup$
I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at this monograph by Boyd for more details.
$endgroup$
– rnegrinho
Jun 12 '14 at 11:42










2 Answers
2






active

oldest

votes


















3












$begingroup$

Here are two interpretations which can lend useful intuition in some circumstances:



Prox is a generalized projection



Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$



Prox is backward Euler



Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.






share|cite|improve this answer









$endgroup$





















    3












    $begingroup$

    If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
    $$
    text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
    $$

    When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)



    You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
    $$
    text{minimize} quad g(x) + h(x)
    $$

    where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
    $$
    x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
    $$

    (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)





    Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
    $$
    x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
    $$

    Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
    begin{align}
    x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
    &= text{prox}_{th}(x^k - tnabla g(x^k)).
    end{align}

    This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.





    Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
    $$
    tag{1} 0 in partial h(x).
    $$

    Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
    begin{align}
    & 0 in partial h(x) \
    iff & x in (I + t partial h)(x) \
    iff & x in (I + t partial h)^{-1}(x).
    end{align}

    The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving



    $$
    x = (I + t partial h)^{-1}(x).
    $$

    A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
    $$
    x^{k+1} = (I + t partial h)^{-1}(x^k).
    $$



    Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".






    share|cite|improve this answer











    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "69"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f168159%2fwhat-is-the-motivation-of-proximal-mapping%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3












      $begingroup$

      Here are two interpretations which can lend useful intuition in some circumstances:



      Prox is a generalized projection



      Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$



      Prox is backward Euler



      Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.






      share|cite|improve this answer









      $endgroup$


















        3












        $begingroup$

        Here are two interpretations which can lend useful intuition in some circumstances:



        Prox is a generalized projection



        Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$



        Prox is backward Euler



        Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.






        share|cite|improve this answer









        $endgroup$
















          3












          3








          3





          $begingroup$

          Here are two interpretations which can lend useful intuition in some circumstances:



          Prox is a generalized projection



          Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$



          Prox is backward Euler



          Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.






          share|cite|improve this answer









          $endgroup$



          Here are two interpretations which can lend useful intuition in some circumstances:



          Prox is a generalized projection



          Consider the indicator of some nonempty convex set $X$: $$i_X(x) = begin{cases}0 & text{if } x in X \ infty & text{if } x not in Xend{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$begin{align}text{prox}_{alpha i_X}(y) &= underset{x}{text{argmin}}(alpha i_X(x)+frac{1}{2}|x - y|^2) \ &= underset{x in X}{text{argmin}}(frac{1}{2}|x - y|^2) \ &= text{proj}_X(y).end{align}$$



          Prox is backward Euler



          Consider the backward Euler update: $x^{k+1} = x^k - alpha nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = text{prox}_{alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = nabla (alpha f(x^star) + frac{1}{2}|x^star - x^k|^2) = alpha nabla f(x^star) + x^star - x^k$.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Nov 2 '18 at 20:10









          MoormanlyMoormanly

          1314




          1314























              3












              $begingroup$

              If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
              $$
              text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
              $$

              When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)



              You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
              $$
              text{minimize} quad g(x) + h(x)
              $$

              where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
              $$
              x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
              $$

              (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)





              Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
              $$
              x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
              $$

              Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
              begin{align}
              x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
              &= text{prox}_{th}(x^k - tnabla g(x^k)).
              end{align}

              This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.





              Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
              $$
              tag{1} 0 in partial h(x).
              $$

              Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
              begin{align}
              & 0 in partial h(x) \
              iff & x in (I + t partial h)(x) \
              iff & x in (I + t partial h)^{-1}(x).
              end{align}

              The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving



              $$
              x = (I + t partial h)^{-1}(x).
              $$

              A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
              $$
              x^{k+1} = (I + t partial h)^{-1}(x^k).
              $$



              Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".






              share|cite|improve this answer











              $endgroup$


















                3












                $begingroup$

                If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
                $$
                text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
                $$

                When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)



                You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
                $$
                text{minimize} quad g(x) + h(x)
                $$

                where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
                $$
                x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
                $$

                (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)





                Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
                $$
                x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
                $$

                Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
                begin{align}
                x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
                &= text{prox}_{th}(x^k - tnabla g(x^k)).
                end{align}

                This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.





                Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
                $$
                tag{1} 0 in partial h(x).
                $$

                Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
                begin{align}
                & 0 in partial h(x) \
                iff & x in (I + t partial h)(x) \
                iff & x in (I + t partial h)^{-1}(x).
                end{align}

                The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving



                $$
                x = (I + t partial h)^{-1}(x).
                $$

                A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
                $$
                x^{k+1} = (I + t partial h)^{-1}(x^k).
                $$



                Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".






                share|cite|improve this answer











                $endgroup$
















                  3












                  3








                  3





                  $begingroup$

                  If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
                  $$
                  text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
                  $$

                  When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)



                  You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
                  $$
                  text{minimize} quad g(x) + h(x)
                  $$

                  where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
                  $$
                  x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
                  $$

                  (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)





                  Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
                  $$
                  x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
                  $$

                  Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
                  begin{align}
                  x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
                  &= text{prox}_{th}(x^k - tnabla g(x^k)).
                  end{align}

                  This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.





                  Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
                  $$
                  tag{1} 0 in partial h(x).
                  $$

                  Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
                  begin{align}
                  & 0 in partial h(x) \
                  iff & x in (I + t partial h)(x) \
                  iff & x in (I + t partial h)^{-1}(x).
                  end{align}

                  The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving



                  $$
                  x = (I + t partial h)^{-1}(x).
                  $$

                  A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
                  $$
                  x^{k+1} = (I + t partial h)^{-1}(x^k).
                  $$



                  Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".






                  share|cite|improve this answer











                  $endgroup$



                  If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by
                  $$
                  text{prox}_{th}(hat x) = arg min_x , h(x) + frac{1}{2t} |x - hat x |_2^2.
                  $$

                  When we evaluate the prox-operator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $hat x$.)



                  You can see that evaluating a prox-operator might be a useful sub-step in an optimization algorithm. For example, suppose that we want to solve the optimization problem
                  $$
                  text{minimize} quad g(x) + h(x)
                  $$

                  where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its prox-operator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the prox-operator of $h$, and repeat. This strategy yields the following iteration:
                  $$
                  x^{k+1} = text{prox}_{th}(x^k - t nabla g(x^k)).
                  $$

                  (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)





                  Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration:
                  $$
                  x^{k+1} = arg min_x quad h(x) + underbrace{g(x^k) + nabla g(x^k)^T(x - x^k)}_{approx , g(x)} + frac{1}{2t} |x - x^k |_2^2.
                  $$

                  Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that
                  begin{align}
                  x^{k+1} &= arg min_x , h(x) + frac{1}{2t} |x - (x^k - tnabla g(x^k)) |_2^2 \
                  &= text{prox}_{th}(x^k - tnabla g(x^k)).
                  end{align}

                  This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.





                  Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that
                  $$
                  tag{1} 0 in partial h(x).
                  $$

                  Here $partial h(x)$ is the subdifferential of $h$ at $x$. The set-valued mapping $x mapsto partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that
                  begin{align}
                  & 0 in partial h(x) \
                  iff & x in (I + t partial h)(x) \
                  iff & x in (I + t partial h)^{-1}(x).
                  end{align}

                  The operator $(I + t partial h)^{-1}$ is called the "resolvent" of the operator $partial h$, and it has nice properties such as the following: if $x in mathbb R^n$, then $(I + t partial h)^{-1}(x)$ is a singleton. Thus, $(I + t partial h)^{-1}$ can be viewed as a function (rather than a set-valued mapping), and solving $0 in partial h(x)$ is equivalent to solving



                  $$
                  x = (I + t partial h)^{-1}(x).
                  $$

                  A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is
                  $$
                  x^{k+1} = (I + t partial h)^{-1}(x^k).
                  $$



                  Punch line: The function $(I + t partial h)^{-1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".







                  share|cite|improve this answer














                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited Dec 17 '18 at 12:47

























                  answered Dec 17 '18 at 12:37









                  littleOlittleO

                  29.6k645109




                  29.6k645109






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Mathematics Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f168159%2fwhat-is-the-motivation-of-proximal-mapping%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Måne

                      Storängen

                      VLT Carioca