For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.



My question is, given the point on the convex, does the gradient only point at the direction at which the function increase/decrease fastest, or the gradient always points at the optimal/extreme point of the cost function?



The former is a local concept, the latter is a global concept.



SGD can eventually converge to the extreme value of the cost function. I'm wondering about the difference between the direction of the gradient given an arbitrary point on the convex and the direction pointing at the global extreme value.



The gradient's direction should be the direction at which the function increase/decrease fastest on that point, right?










share|cite|improve this question





























    up vote
    1
    down vote

    favorite












    Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.



    My question is, given the point on the convex, does the gradient only point at the direction at which the function increase/decrease fastest, or the gradient always points at the optimal/extreme point of the cost function?



    The former is a local concept, the latter is a global concept.



    SGD can eventually converge to the extreme value of the cost function. I'm wondering about the difference between the direction of the gradient given an arbitrary point on the convex and the direction pointing at the global extreme value.



    The gradient's direction should be the direction at which the function increase/decrease fastest on that point, right?










    share|cite|improve this question

























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.



      My question is, given the point on the convex, does the gradient only point at the direction at which the function increase/decrease fastest, or the gradient always points at the optimal/extreme point of the cost function?



      The former is a local concept, the latter is a global concept.



      SGD can eventually converge to the extreme value of the cost function. I'm wondering about the difference between the direction of the gradient given an arbitrary point on the convex and the direction pointing at the global extreme value.



      The gradient's direction should be the direction at which the function increase/decrease fastest on that point, right?










      share|cite|improve this question















      Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process.



      My question is, given the point on the convex, does the gradient only point at the direction at which the function increase/decrease fastest, or the gradient always points at the optimal/extreme point of the cost function?



      The former is a local concept, the latter is a global concept.



      SGD can eventually converge to the extreme value of the cost function. I'm wondering about the difference between the direction of the gradient given an arbitrary point on the convex and the direction pointing at the global extreme value.



      The gradient's direction should be the direction at which the function increase/decrease fastest on that point, right?







      optimization sgd convex






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 5 hours ago

























      asked 5 hours ago









      Tyler 提督九门步军巡捕五营统领

      5571516




      5571516




















          3 Answers
          3






          active

          oldest

          votes

















          up vote
          3
          down vote













          Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?






          share|cite|improve this answer



























            up vote
            2
            down vote















            • Gradient descent methods use the slope of the surface.

            • This will not necessarily (or even most likely not) point directly
              towards the extreme point.

            An intuitive view is to imagine a path of descent that is a curved path. See for instance the below example



            This particular example looks a bit like a valley. If you wish to walk from a hill to some low point in the valley (say, the bottom of an empty lake) then a descending path is to first walk down the hill towards the river and then you walk along the river to the lake.



            On the hill you are not knowing in which direction the bottom of the lake will be, but you do know that you are gonna reach the lake as long as you walk downwards.



            descent path




            code example:



            library("shape")
            set.seed(1)

            # model some data
            x <- seq(0,25,0.25)
            y <- exp(-0.25*x) - exp(-0.3*x) + rnorm(101, 0 ,0.01)

            # view data
            plot(x, y)


            # loss function
            fr <- function(p)
            a <- p[1]
            b <- p[2]
            ypred <- exp(-a*x)-exp(-b*x)
            sum((y-ypred)^2)


            # gradient of loss function
            gr <- function(p)
            a <- p[1]
            b <- p[2]
            ypred <- exp(-a*x)-exp(-b*x)
            dfda <- 2*sum(x*exp(-a*x)*(y-ypred))
            dfdb <- 2*sum(-x*exp(-b*x)*(y-ypred))
            c(dfda,dfdb)



            # compute loss function on a grid
            n=301
            xc <- 0.4*seq(0,1,length.out=n)
            yc <- 0.4*seq(0,1,length.out=n)
            z <- matrix(rep(0,n^2),n)
            for (i in 1:n)
            for(j in 1:n)
            z[i,j] <- fr(c(xc[i],yc[j]))




            # levels for plotting
            levels <- 10^seq(-2.5,2,0.5)
            key <- seq(-2.5,2,0.5)

            # colours for plotting
            colours <- function(n) hsv(c(seq(0.15,0.7,length.out=n),0),
            c(seq(0.2,0.4,length.out=n),0),
            c(seq(1,1,length.out=n),0.9))

            # empty plot
            plot(-1000,-1000,
            xlab="a",ylab="b",
            xlim=range(xc),
            ylim=range(yc))

            # add contours
            .filled.contour(xc,yc,z,
            col=colours(10),
            levels=levels)
            contour(xc,yc,z,add=1, levels=levels, labels = key)


            # compute path
            # start value
            new=c(0.2,0.02)

            # make lots of small steps
            for (i in 1:2000)
            ### safe old value
            old <- new

            ### compute step direction by using gradient
            grr <- gr(new)
            lg <- sqrt(100*grr[1]^2+grr[2]^2)
            step <- -grr/lg

            ### find best step size (yes this is a bit computation intensive)
            min <- fr(old)
            scale = 10^(1-2*i/5000)/4
            for (j in 10^-6*scale*10^seq(0,3,length.out=100))
            if (fr(old+step*j)<min)
            new <- old+step*j
            min <- fr(new)


            ### plot path
            lines(c(old[1],new[1]),c(old[2],new[2]),lw=3)


            # finsih plot with title and annotation
            title(expression(paste("optimizing", sum((e^-ax[i]-e^-bx[i]-y[i])^2,i==1,n))))
            points(0.2,0.02)
            text(0.2,0.02,"start",pos=4,cex=1)
            points(new[1],new[2])
            text(new[1],new[2],"end",pos=4,cex=1)





            share|cite|improve this answer





























              up vote
              1
              down vote













              They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.



              An image of an elongated convex function and arrows showing that the direction of the steepest descent is not the same as the direction towards the global optimum






              share|cite|improve this answer




















                Your Answer




                StackExchange.ifUsing("editor", function ()
                return StackExchange.using("mathjaxEditing", function ()
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                );
                );
                , "mathjax-editing");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "65"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                convertImagesToLinks: false,
                noModals: false,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                 

                draft saved


                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f367397%2ffor-convex-problems-does-gradient-in-stochastic-gradient-descent-sgd-always-p%23new-answer', 'question_page');

                );

                Post as a guest






























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes








                up vote
                3
                down vote













                Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?






                share|cite|improve this answer
























                  up vote
                  3
                  down vote













                  Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?






                  share|cite|improve this answer






















                    up vote
                    3
                    down vote










                    up vote
                    3
                    down vote









                    Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?






                    share|cite|improve this answer












                    Local steepest direction is not the same with the global optimum direction. If it were, then your gradient direction wouldn't change; because if you go towards your optimum always, your direction vector would point optimum always. But, that's not the case. If it were the case, why bother calculating your gradient every iteration?







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered 4 hours ago









                    gunes

                    4985




                    4985






















                        up vote
                        2
                        down vote















                        • Gradient descent methods use the slope of the surface.

                        • This will not necessarily (or even most likely not) point directly
                          towards the extreme point.

                        An intuitive view is to imagine a path of descent that is a curved path. See for instance the below example



                        This particular example looks a bit like a valley. If you wish to walk from a hill to some low point in the valley (say, the bottom of an empty lake) then a descending path is to first walk down the hill towards the river and then you walk along the river to the lake.



                        On the hill you are not knowing in which direction the bottom of the lake will be, but you do know that you are gonna reach the lake as long as you walk downwards.



                        descent path




                        code example:



                        library("shape")
                        set.seed(1)

                        # model some data
                        x <- seq(0,25,0.25)
                        y <- exp(-0.25*x) - exp(-0.3*x) + rnorm(101, 0 ,0.01)

                        # view data
                        plot(x, y)


                        # loss function
                        fr <- function(p)
                        a <- p[1]
                        b <- p[2]
                        ypred <- exp(-a*x)-exp(-b*x)
                        sum((y-ypred)^2)


                        # gradient of loss function
                        gr <- function(p)
                        a <- p[1]
                        b <- p[2]
                        ypred <- exp(-a*x)-exp(-b*x)
                        dfda <- 2*sum(x*exp(-a*x)*(y-ypred))
                        dfdb <- 2*sum(-x*exp(-b*x)*(y-ypred))
                        c(dfda,dfdb)



                        # compute loss function on a grid
                        n=301
                        xc <- 0.4*seq(0,1,length.out=n)
                        yc <- 0.4*seq(0,1,length.out=n)
                        z <- matrix(rep(0,n^2),n)
                        for (i in 1:n)
                        for(j in 1:n)
                        z[i,j] <- fr(c(xc[i],yc[j]))




                        # levels for plotting
                        levels <- 10^seq(-2.5,2,0.5)
                        key <- seq(-2.5,2,0.5)

                        # colours for plotting
                        colours <- function(n) hsv(c(seq(0.15,0.7,length.out=n),0),
                        c(seq(0.2,0.4,length.out=n),0),
                        c(seq(1,1,length.out=n),0.9))

                        # empty plot
                        plot(-1000,-1000,
                        xlab="a",ylab="b",
                        xlim=range(xc),
                        ylim=range(yc))

                        # add contours
                        .filled.contour(xc,yc,z,
                        col=colours(10),
                        levels=levels)
                        contour(xc,yc,z,add=1, levels=levels, labels = key)


                        # compute path
                        # start value
                        new=c(0.2,0.02)

                        # make lots of small steps
                        for (i in 1:2000)
                        ### safe old value
                        old <- new

                        ### compute step direction by using gradient
                        grr <- gr(new)
                        lg <- sqrt(100*grr[1]^2+grr[2]^2)
                        step <- -grr/lg

                        ### find best step size (yes this is a bit computation intensive)
                        min <- fr(old)
                        scale = 10^(1-2*i/5000)/4
                        for (j in 10^-6*scale*10^seq(0,3,length.out=100))
                        if (fr(old+step*j)<min)
                        new <- old+step*j
                        min <- fr(new)


                        ### plot path
                        lines(c(old[1],new[1]),c(old[2],new[2]),lw=3)


                        # finsih plot with title and annotation
                        title(expression(paste("optimizing", sum((e^-ax[i]-e^-bx[i]-y[i])^2,i==1,n))))
                        points(0.2,0.02)
                        text(0.2,0.02,"start",pos=4,cex=1)
                        points(new[1],new[2])
                        text(new[1],new[2],"end",pos=4,cex=1)





                        share|cite|improve this answer


























                          up vote
                          2
                          down vote















                          • Gradient descent methods use the slope of the surface.

                          • This will not necessarily (or even most likely not) point directly
                            towards the extreme point.

                          An intuitive view is to imagine a path of descent that is a curved path. See for instance the below example



                          This particular example looks a bit like a valley. If you wish to walk from a hill to some low point in the valley (say, the bottom of an empty lake) then a descending path is to first walk down the hill towards the river and then you walk along the river to the lake.



                          On the hill you are not knowing in which direction the bottom of the lake will be, but you do know that you are gonna reach the lake as long as you walk downwards.



                          descent path




                          code example:



                          library("shape")
                          set.seed(1)

                          # model some data
                          x <- seq(0,25,0.25)
                          y <- exp(-0.25*x) - exp(-0.3*x) + rnorm(101, 0 ,0.01)

                          # view data
                          plot(x, y)


                          # loss function
                          fr <- function(p)
                          a <- p[1]
                          b <- p[2]
                          ypred <- exp(-a*x)-exp(-b*x)
                          sum((y-ypred)^2)


                          # gradient of loss function
                          gr <- function(p)
                          a <- p[1]
                          b <- p[2]
                          ypred <- exp(-a*x)-exp(-b*x)
                          dfda <- 2*sum(x*exp(-a*x)*(y-ypred))
                          dfdb <- 2*sum(-x*exp(-b*x)*(y-ypred))
                          c(dfda,dfdb)



                          # compute loss function on a grid
                          n=301
                          xc <- 0.4*seq(0,1,length.out=n)
                          yc <- 0.4*seq(0,1,length.out=n)
                          z <- matrix(rep(0,n^2),n)
                          for (i in 1:n)
                          for(j in 1:n)
                          z[i,j] <- fr(c(xc[i],yc[j]))




                          # levels for plotting
                          levels <- 10^seq(-2.5,2,0.5)
                          key <- seq(-2.5,2,0.5)

                          # colours for plotting
                          colours <- function(n) hsv(c(seq(0.15,0.7,length.out=n),0),
                          c(seq(0.2,0.4,length.out=n),0),
                          c(seq(1,1,length.out=n),0.9))

                          # empty plot
                          plot(-1000,-1000,
                          xlab="a",ylab="b",
                          xlim=range(xc),
                          ylim=range(yc))

                          # add contours
                          .filled.contour(xc,yc,z,
                          col=colours(10),
                          levels=levels)
                          contour(xc,yc,z,add=1, levels=levels, labels = key)


                          # compute path
                          # start value
                          new=c(0.2,0.02)

                          # make lots of small steps
                          for (i in 1:2000)
                          ### safe old value
                          old <- new

                          ### compute step direction by using gradient
                          grr <- gr(new)
                          lg <- sqrt(100*grr[1]^2+grr[2]^2)
                          step <- -grr/lg

                          ### find best step size (yes this is a bit computation intensive)
                          min <- fr(old)
                          scale = 10^(1-2*i/5000)/4
                          for (j in 10^-6*scale*10^seq(0,3,length.out=100))
                          if (fr(old+step*j)<min)
                          new <- old+step*j
                          min <- fr(new)


                          ### plot path
                          lines(c(old[1],new[1]),c(old[2],new[2]),lw=3)


                          # finsih plot with title and annotation
                          title(expression(paste("optimizing", sum((e^-ax[i]-e^-bx[i]-y[i])^2,i==1,n))))
                          points(0.2,0.02)
                          text(0.2,0.02,"start",pos=4,cex=1)
                          points(new[1],new[2])
                          text(new[1],new[2],"end",pos=4,cex=1)





                          share|cite|improve this answer
























                            up vote
                            2
                            down vote










                            up vote
                            2
                            down vote











                            • Gradient descent methods use the slope of the surface.

                            • This will not necessarily (or even most likely not) point directly
                              towards the extreme point.

                            An intuitive view is to imagine a path of descent that is a curved path. See for instance the below example



                            This particular example looks a bit like a valley. If you wish to walk from a hill to some low point in the valley (say, the bottom of an empty lake) then a descending path is to first walk down the hill towards the river and then you walk along the river to the lake.



                            On the hill you are not knowing in which direction the bottom of the lake will be, but you do know that you are gonna reach the lake as long as you walk downwards.



                            descent path




                            code example:



                            library("shape")
                            set.seed(1)

                            # model some data
                            x <- seq(0,25,0.25)
                            y <- exp(-0.25*x) - exp(-0.3*x) + rnorm(101, 0 ,0.01)

                            # view data
                            plot(x, y)


                            # loss function
                            fr <- function(p)
                            a <- p[1]
                            b <- p[2]
                            ypred <- exp(-a*x)-exp(-b*x)
                            sum((y-ypred)^2)


                            # gradient of loss function
                            gr <- function(p)
                            a <- p[1]
                            b <- p[2]
                            ypred <- exp(-a*x)-exp(-b*x)
                            dfda <- 2*sum(x*exp(-a*x)*(y-ypred))
                            dfdb <- 2*sum(-x*exp(-b*x)*(y-ypred))
                            c(dfda,dfdb)



                            # compute loss function on a grid
                            n=301
                            xc <- 0.4*seq(0,1,length.out=n)
                            yc <- 0.4*seq(0,1,length.out=n)
                            z <- matrix(rep(0,n^2),n)
                            for (i in 1:n)
                            for(j in 1:n)
                            z[i,j] <- fr(c(xc[i],yc[j]))




                            # levels for plotting
                            levels <- 10^seq(-2.5,2,0.5)
                            key <- seq(-2.5,2,0.5)

                            # colours for plotting
                            colours <- function(n) hsv(c(seq(0.15,0.7,length.out=n),0),
                            c(seq(0.2,0.4,length.out=n),0),
                            c(seq(1,1,length.out=n),0.9))

                            # empty plot
                            plot(-1000,-1000,
                            xlab="a",ylab="b",
                            xlim=range(xc),
                            ylim=range(yc))

                            # add contours
                            .filled.contour(xc,yc,z,
                            col=colours(10),
                            levels=levels)
                            contour(xc,yc,z,add=1, levels=levels, labels = key)


                            # compute path
                            # start value
                            new=c(0.2,0.02)

                            # make lots of small steps
                            for (i in 1:2000)
                            ### safe old value
                            old <- new

                            ### compute step direction by using gradient
                            grr <- gr(new)
                            lg <- sqrt(100*grr[1]^2+grr[2]^2)
                            step <- -grr/lg

                            ### find best step size (yes this is a bit computation intensive)
                            min <- fr(old)
                            scale = 10^(1-2*i/5000)/4
                            for (j in 10^-6*scale*10^seq(0,3,length.out=100))
                            if (fr(old+step*j)<min)
                            new <- old+step*j
                            min <- fr(new)


                            ### plot path
                            lines(c(old[1],new[1]),c(old[2],new[2]),lw=3)


                            # finsih plot with title and annotation
                            title(expression(paste("optimizing", sum((e^-ax[i]-e^-bx[i]-y[i])^2,i==1,n))))
                            points(0.2,0.02)
                            text(0.2,0.02,"start",pos=4,cex=1)
                            points(new[1],new[2])
                            text(new[1],new[2],"end",pos=4,cex=1)





                            share|cite|improve this answer
















                            • Gradient descent methods use the slope of the surface.

                            • This will not necessarily (or even most likely not) point directly
                              towards the extreme point.

                            An intuitive view is to imagine a path of descent that is a curved path. See for instance the below example



                            This particular example looks a bit like a valley. If you wish to walk from a hill to some low point in the valley (say, the bottom of an empty lake) then a descending path is to first walk down the hill towards the river and then you walk along the river to the lake.



                            On the hill you are not knowing in which direction the bottom of the lake will be, but you do know that you are gonna reach the lake as long as you walk downwards.



                            descent path




                            code example:



                            library("shape")
                            set.seed(1)

                            # model some data
                            x <- seq(0,25,0.25)
                            y <- exp(-0.25*x) - exp(-0.3*x) + rnorm(101, 0 ,0.01)

                            # view data
                            plot(x, y)


                            # loss function
                            fr <- function(p)
                            a <- p[1]
                            b <- p[2]
                            ypred <- exp(-a*x)-exp(-b*x)
                            sum((y-ypred)^2)


                            # gradient of loss function
                            gr <- function(p)
                            a <- p[1]
                            b <- p[2]
                            ypred <- exp(-a*x)-exp(-b*x)
                            dfda <- 2*sum(x*exp(-a*x)*(y-ypred))
                            dfdb <- 2*sum(-x*exp(-b*x)*(y-ypred))
                            c(dfda,dfdb)



                            # compute loss function on a grid
                            n=301
                            xc <- 0.4*seq(0,1,length.out=n)
                            yc <- 0.4*seq(0,1,length.out=n)
                            z <- matrix(rep(0,n^2),n)
                            for (i in 1:n)
                            for(j in 1:n)
                            z[i,j] <- fr(c(xc[i],yc[j]))




                            # levels for plotting
                            levels <- 10^seq(-2.5,2,0.5)
                            key <- seq(-2.5,2,0.5)

                            # colours for plotting
                            colours <- function(n) hsv(c(seq(0.15,0.7,length.out=n),0),
                            c(seq(0.2,0.4,length.out=n),0),
                            c(seq(1,1,length.out=n),0.9))

                            # empty plot
                            plot(-1000,-1000,
                            xlab="a",ylab="b",
                            xlim=range(xc),
                            ylim=range(yc))

                            # add contours
                            .filled.contour(xc,yc,z,
                            col=colours(10),
                            levels=levels)
                            contour(xc,yc,z,add=1, levels=levels, labels = key)


                            # compute path
                            # start value
                            new=c(0.2,0.02)

                            # make lots of small steps
                            for (i in 1:2000)
                            ### safe old value
                            old <- new

                            ### compute step direction by using gradient
                            grr <- gr(new)
                            lg <- sqrt(100*grr[1]^2+grr[2]^2)
                            step <- -grr/lg

                            ### find best step size (yes this is a bit computation intensive)
                            min <- fr(old)
                            scale = 10^(1-2*i/5000)/4
                            for (j in 10^-6*scale*10^seq(0,3,length.out=100))
                            if (fr(old+step*j)<min)
                            new <- old+step*j
                            min <- fr(new)


                            ### plot path
                            lines(c(old[1],new[1]),c(old[2],new[2]),lw=3)


                            # finsih plot with title and annotation
                            title(expression(paste("optimizing", sum((e^-ax[i]-e^-bx[i]-y[i])^2,i==1,n))))
                            points(0.2,0.02)
                            text(0.2,0.02,"start",pos=4,cex=1)
                            points(new[1],new[2])
                            text(new[1],new[2],"end",pos=4,cex=1)






                            share|cite|improve this answer














                            share|cite|improve this answer



                            share|cite|improve this answer








                            edited 17 mins ago

























                            answered 5 hours ago









                            Martijn Weterings

                            9,3041150




                            9,3041150




















                                up vote
                                1
                                down vote













                                They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.



                                An image of an elongated convex function and arrows showing that the direction of the steepest descent is not the same as the direction towards the global optimum






                                share|cite|improve this answer
























                                  up vote
                                  1
                                  down vote













                                  They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.



                                  An image of an elongated convex function and arrows showing that the direction of the steepest descent is not the same as the direction towards the global optimum






                                  share|cite|improve this answer






















                                    up vote
                                    1
                                    down vote










                                    up vote
                                    1
                                    down vote









                                    They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.



                                    An image of an elongated convex function and arrows showing that the direction of the steepest descent is not the same as the direction towards the global optimum






                                    share|cite|improve this answer












                                    They say an image is worth more than a thousand words. In the following example (courtesy of MS Paint, a handy tool for amateur and professional statisticians both) you can see a convex function surface and a point where the direction of the steepest descent clearly differs from the direction towards the optimum.



                                    An image of an elongated convex function and arrows showing that the direction of the steepest descent is not the same as the direction towards the global optimum







                                    share|cite|improve this answer












                                    share|cite|improve this answer



                                    share|cite|improve this answer










                                    answered 3 hours ago









                                    Jan Kukacka

                                    4,28811233




                                    4,28811233



























                                         

                                        draft saved


                                        draft discarded















































                                         


                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f367397%2ffor-convex-problems-does-gradient-in-stochastic-gradient-descent-sgd-always-p%23new-answer', 'question_page');

                                        );

                                        Post as a guest













































































                                        Comments

                                        Popular posts from this blog

                                        What does second last employer means? [closed]

                                        List of Gilmore Girls characters

                                        Confectionery