Understanding minimizing cost correctlyUnderstanding Locally Weighted Linear RegressionUnderstanding Logistic Regression Cost functionCost function for Ordinal Regression using neural networksCustom c++ LSTM slows down at 0.36 cost is usual?Policy Gradient Methods - ScoreFunction & Log(policy)How to Define a Cost Fucntion?Logistic regression cost functionCost function in linear regressionML / Multivariable cost minimization problems / approach summary?Loss function minimizing by pushing precision and recall to 0

Synchronized implementation of a bank account in Java

Unfrosted light bulb

Is there a term for accumulated dirt on the outside of your hands and feet?

Differential and Linear trail propagation in Noekeon

Print a physical multiplication table

Four married couples attend a party. Each person shakes hands with every other person, except their own spouse, exactly once. How many handshakes?

How to terminate ping <dest> &

PTIJ What is the inyan of the Konami code in Uncle Moishy's song?

What are idioms that are antonymous to "don't skimp on"?

Describing a chess game in a novel

Violin - Can double stops be played when the strings are not next to each other?

I got the following comment from a reputed math journal. What does it mean?

Do I need to consider instance restrictions when showing a language is in P?

Variable completely messes up echoed string

gerund and noun applications

Does multi-classing into Fighter give you heavy armor proficiency?

Help rendering a complicated sum/product formula

Should I use acronyms in dialogues before telling the readers what it stands for in fiction?

Hausdorff dimension of the boundary of fibres of Lipschitz maps

Is it insecure to send a password in a `curl` command?

How to generate binary array whose elements with values 1 are randomly drawn

Worshiping one God at a time?

What does Deadpool mean by "left the house in that shirt"?

Calculate the frequency of characters in a string



Understanding minimizing cost correctly


Understanding Locally Weighted Linear RegressionUnderstanding Logistic Regression Cost functionCost function for Ordinal Regression using neural networksCustom c++ LSTM slows down at 0.36 cost is usual?Policy Gradient Methods - ScoreFunction & Log(policy)How to Define a Cost Fucntion?Logistic regression cost functionCost function in linear regressionML / Multivariable cost minimization problems / approach summary?Loss function minimizing by pushing precision and recall to 0













2












$begingroup$


I cannot wrap my head around this simple concept.



Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):



$h(x) = theta cdot x$



The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.



Then, theta would be updated as:



$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.



From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.



Here is where I don't seem to wrap my head around:



If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.



I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?



Sorry for asking such a silly question.



Thank you.










share|improve this question









New contributor




zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    2












    $begingroup$


    I cannot wrap my head around this simple concept.



    Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):



    $h(x) = theta cdot x$



    The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.



    Then, theta would be updated as:



    $theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.



    From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.



    Here is where I don't seem to wrap my head around:



    If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.



    I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?



    Sorry for asking such a silly question.



    Thank you.










    share|improve this question









    New contributor




    zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      2












      2








      2





      $begingroup$


      I cannot wrap my head around this simple concept.



      Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):



      $h(x) = theta cdot x$



      The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.



      Then, theta would be updated as:



      $theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.



      From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.



      Here is where I don't seem to wrap my head around:



      If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.



      I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?



      Sorry for asking such a silly question.



      Thank you.










      share|improve this question









      New contributor




      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I cannot wrap my head around this simple concept.



      Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):



      $h(x) = theta cdot x$



      The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.



      Then, theta would be updated as:



      $theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.



      From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.



      Here is where I don't seem to wrap my head around:



      If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.



      I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?



      Sorry for asking such a silly question.



      Thank you.







      linear-regression cost-function






      share|improve this question









      New contributor




      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited 14 hours ago









      Siong Thye Goh

      1,302418




      1,302418






      New contributor




      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 14 hours ago









      zafirzaryazafirzarya

      132




      132




      New contributor




      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$



          Hence solving this, would give us $$X^TXtheta =X^Ty$$



          Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.



          Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.






          share|improve this answer









          $endgroup$








          • 1




            $begingroup$
            Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
            $endgroup$
            – zafirzarya
            14 hours ago










          • $begingroup$
            I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
            $endgroup$
            – Siong Thye Goh
            14 hours ago










          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47466%2funderstanding-minimizing-cost-correctly%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2












          $begingroup$

          Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$



          Hence solving this, would give us $$X^TXtheta =X^Ty$$



          Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.



          Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.






          share|improve this answer









          $endgroup$








          • 1




            $begingroup$
            Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
            $endgroup$
            – zafirzarya
            14 hours ago










          • $begingroup$
            I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
            $endgroup$
            – Siong Thye Goh
            14 hours ago















          2












          $begingroup$

          Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$



          Hence solving this, would give us $$X^TXtheta =X^Ty$$



          Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.



          Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.






          share|improve this answer









          $endgroup$








          • 1




            $begingroup$
            Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
            $endgroup$
            – zafirzarya
            14 hours ago










          • $begingroup$
            I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
            $endgroup$
            – Siong Thye Goh
            14 hours ago













          2












          2








          2





          $begingroup$

          Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$



          Hence solving this, would give us $$X^TXtheta =X^Ty$$



          Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.



          Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.






          share|improve this answer









          $endgroup$



          Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$



          Hence solving this, would give us $$X^TXtheta =X^Ty$$



          Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.



          Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 14 hours ago









          Siong Thye GohSiong Thye Goh

          1,302418




          1,302418







          • 1




            $begingroup$
            Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
            $endgroup$
            – zafirzarya
            14 hours ago










          • $begingroup$
            I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
            $endgroup$
            – Siong Thye Goh
            14 hours ago












          • 1




            $begingroup$
            Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
            $endgroup$
            – zafirzarya
            14 hours ago










          • $begingroup$
            I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
            $endgroup$
            – Siong Thye Goh
            14 hours ago







          1




          1




          $begingroup$
          Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
          $endgroup$
          – zafirzarya
          14 hours ago




          $begingroup$
          Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
          $endgroup$
          – zafirzarya
          14 hours ago












          $begingroup$
          I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
          $endgroup$
          – Siong Thye Goh
          14 hours ago




          $begingroup$
          I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
          $endgroup$
          – Siong Thye Goh
          14 hours ago










          zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.












          zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.











          zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.














          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47466%2funderstanding-minimizing-cost-correctly%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to create a command for the “strange m” symbol in latex? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)How do you make your own symbol when Detexify fails?Writing bold small caps with mathpazo packageplus-minus symbol with parenthesis around the minus signGreek character in Beamer document titleHow to create dashed right arrow over symbol?Currency symbol: Turkish LiraDouble prec as a single symbol?Plus Sign Too Big; How to Call adfbullet?Is there a TeX macro for three-legged pi?How do I get my integral-like symbol to align like the integral?How to selectively substitute a letter with another symbol representing the same letterHow do I generate a less than symbol and vertical bar that are the same height?

          Българска екзархия Съдържание История | Български екзарси | Вижте също | Външни препратки | Литература | Бележки | НавигацияУстав за управлението на българската екзархия. Цариград, 1870Слово на Ловешкия митрополит Иларион при откриването на Българския народен събор в Цариград на 23. II. 1870 г.Българската правда и гръцката кривда. От С. М. (= Софийски Мелетий). Цариград, 1872Предстоятели на Българската екзархияПодмененият ВеликденИнформационна агенция „Фокус“Димитър Ризов. Българите в техните исторически, етнографически и политически граници (Атлас съдържащ 40 карти). Berlin, Königliche Hoflithographie, Hof-Buch- und -Steindruckerei Wilhelm Greve, 1917Report of the International Commission to Inquire into the Causes and Conduct of the Balkan Wars

          Чепеларе Съдържание География | История | Население | Спортни и природни забележителности | Културни и исторически обекти | Религии | Обществени институции | Известни личности | Редовни събития | Галерия | Източници | Литература | Външни препратки | Навигация41°43′23.99″ с. ш. 24°41′09.99″ и. д. / 41.723333° с. ш. 24.686111° и. д.*ЧепелареЧепеларски Linux fest 2002Начало на Зимен сезон 2005/06Национални хайдушки празници „Капитан Петко Войвода“Град ЧепелареЧепеларе – народният ски курортbgrod.orgwww.terranatura.hit.bgСправка за населението на гр. Исперих, общ. Исперих, обл. РазградМузей на родопския карстМузей на спорта и скитеЧепеларебългарскибългарскианглийскитукИстория на градаСки писти в ЧепелареВремето в ЧепелареРадио и телевизия в ЧепелареЧепеларе мами с родопски чар и добри пистиЕвтин туризъм и снежни атракции в ЧепелареМестоположениеИнформация и снимки от музея на родопския карст3D панорами от ЧепелареЧепелареррр