A good linear regression model without correlation between the predictors and the result

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
9
down vote

favorite
1












I have trained a linear regression model, using a set of variables/features. And the model has a good performance. However, I have realized that there is no variable with a good correlation with the predicted variable. How is it possible?










share|cite|improve this question

















  • 3




    These are great answers, but the question is missing a lot of details that the answers are striving to fill in. The biggest question in my mind is what you mean by "good correlation."
    – DHW
    4 hours ago
















up vote
9
down vote

favorite
1












I have trained a linear regression model, using a set of variables/features. And the model has a good performance. However, I have realized that there is no variable with a good correlation with the predicted variable. How is it possible?










share|cite|improve this question

















  • 3




    These are great answers, but the question is missing a lot of details that the answers are striving to fill in. The biggest question in my mind is what you mean by "good correlation."
    – DHW
    4 hours ago












up vote
9
down vote

favorite
1









up vote
9
down vote

favorite
1






1





I have trained a linear regression model, using a set of variables/features. And the model has a good performance. However, I have realized that there is no variable with a good correlation with the predicted variable. How is it possible?










share|cite|improve this question













I have trained a linear regression model, using a set of variables/features. And the model has a good performance. However, I have realized that there is no variable with a good correlation with the predicted variable. How is it possible?







regression machine-learning correlation






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked 15 hours ago









Zaratruta

462




462







  • 3




    These are great answers, but the question is missing a lot of details that the answers are striving to fill in. The biggest question in my mind is what you mean by "good correlation."
    – DHW
    4 hours ago












  • 3




    These are great answers, but the question is missing a lot of details that the answers are striving to fill in. The biggest question in my mind is what you mean by "good correlation."
    – DHW
    4 hours ago







3




3




These are great answers, but the question is missing a lot of details that the answers are striving to fill in. The biggest question in my mind is what you mean by "good correlation."
– DHW
4 hours ago




These are great answers, but the question is missing a lot of details that the answers are striving to fill in. The biggest question in my mind is what you mean by "good correlation."
– DHW
4 hours ago










2 Answers
2






active

oldest

votes

















up vote
20
down vote













A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low - or even zero - marginal correlation (pairwise correlation).



Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) "predictive" value among a collection of other variables.



Consider the following data:



 y x
1 6 6
2 12 12
3 18 18
4 24 24
5 1 42
6 7 48
7 13 54
8 19 60


The correlation between y and x is $0$. If I draw the least squares line, it's perfectly horizontal and the $R^2$ is naturally going to be $0$.



But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:



 y x g
1 6 6 0
2 12 12 0
3 18 18 0
4 24 24 0
5 1 42 1
6 7 48 1
7 13 54 1
8 19 60 1


The $R^2$ of a linear regression model with both the x and g variables in it will be 1.



Plot of y vs x showing a lack of pairwise linear relationship but with color indicating the group; within each group the relationship is perfect



It's possible for something this sort of thing to happen with every one of the variables in the model - that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.



Additional reading:



https://en.wikipedia.org/wiki/Omitted-variable_bias



https://en.wikipedia.org/wiki/Simpson%27s_paradox






share|cite|improve this answer





























    up vote
    2
    down vote













    I assume you are training a multiple regression model, in which you have multiple independent variables $X_1$, $X_2$, ..., regressed on Y. The simple answer here is a pairwise correlation is like running an underspecified regression model. As such, you omitted important variables.



    More specifically, when you state "there is no variable with a good correlation with the predicted variable", it sounds like you are checking the pairwise correlation between each independent variable with the dependent variable, Y. This is possible when $X_2$ brings in important, new information and helps clear up the confounding between $X_1$ and Y. With that confounding, though, we may not see a linear pair-wise correlation between $X_1$ and Y. You may also want to check the relationship between partial correlation $rho_x_2$ and multiple regression $y=beta_1X_1 +beta_2X_2 + epsilon$. Multiple regression have a more close relationship with partial correlation than pairwise correlation, $rho_x_1,y$.






    share|cite|improve this answer








    New contributor




    Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.

















      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "65"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366666%2fa-good-linear-regression-model-without-correlation-between-the-predictors-and-th%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      20
      down vote













      A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low - or even zero - marginal correlation (pairwise correlation).



      Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) "predictive" value among a collection of other variables.



      Consider the following data:



       y x
      1 6 6
      2 12 12
      3 18 18
      4 24 24
      5 1 42
      6 7 48
      7 13 54
      8 19 60


      The correlation between y and x is $0$. If I draw the least squares line, it's perfectly horizontal and the $R^2$ is naturally going to be $0$.



      But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:



       y x g
      1 6 6 0
      2 12 12 0
      3 18 18 0
      4 24 24 0
      5 1 42 1
      6 7 48 1
      7 13 54 1
      8 19 60 1


      The $R^2$ of a linear regression model with both the x and g variables in it will be 1.



      Plot of y vs x showing a lack of pairwise linear relationship but with color indicating the group; within each group the relationship is perfect



      It's possible for something this sort of thing to happen with every one of the variables in the model - that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.



      Additional reading:



      https://en.wikipedia.org/wiki/Omitted-variable_bias



      https://en.wikipedia.org/wiki/Simpson%27s_paradox






      share|cite|improve this answer


























        up vote
        20
        down vote













        A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low - or even zero - marginal correlation (pairwise correlation).



        Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) "predictive" value among a collection of other variables.



        Consider the following data:



         y x
        1 6 6
        2 12 12
        3 18 18
        4 24 24
        5 1 42
        6 7 48
        7 13 54
        8 19 60


        The correlation between y and x is $0$. If I draw the least squares line, it's perfectly horizontal and the $R^2$ is naturally going to be $0$.



        But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:



         y x g
        1 6 6 0
        2 12 12 0
        3 18 18 0
        4 24 24 0
        5 1 42 1
        6 7 48 1
        7 13 54 1
        8 19 60 1


        The $R^2$ of a linear regression model with both the x and g variables in it will be 1.



        Plot of y vs x showing a lack of pairwise linear relationship but with color indicating the group; within each group the relationship is perfect



        It's possible for something this sort of thing to happen with every one of the variables in the model - that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.



        Additional reading:



        https://en.wikipedia.org/wiki/Omitted-variable_bias



        https://en.wikipedia.org/wiki/Simpson%27s_paradox






        share|cite|improve this answer
























          up vote
          20
          down vote










          up vote
          20
          down vote









          A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low - or even zero - marginal correlation (pairwise correlation).



          Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) "predictive" value among a collection of other variables.



          Consider the following data:



           y x
          1 6 6
          2 12 12
          3 18 18
          4 24 24
          5 1 42
          6 7 48
          7 13 54
          8 19 60


          The correlation between y and x is $0$. If I draw the least squares line, it's perfectly horizontal and the $R^2$ is naturally going to be $0$.



          But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:



           y x g
          1 6 6 0
          2 12 12 0
          3 18 18 0
          4 24 24 0
          5 1 42 1
          6 7 48 1
          7 13 54 1
          8 19 60 1


          The $R^2$ of a linear regression model with both the x and g variables in it will be 1.



          Plot of y vs x showing a lack of pairwise linear relationship but with color indicating the group; within each group the relationship is perfect



          It's possible for something this sort of thing to happen with every one of the variables in the model - that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.



          Additional reading:



          https://en.wikipedia.org/wiki/Omitted-variable_bias



          https://en.wikipedia.org/wiki/Simpson%27s_paradox






          share|cite|improve this answer














          A pair of variables may show high partial correlation (the correlation accounting for the impact of other variables) but low - or even zero - marginal correlation (pairwise correlation).



          Which means that pairwise correlation between a response, y and some predictor, x may be of little value in identifying suitable variables with (linear) "predictive" value among a collection of other variables.



          Consider the following data:



           y x
          1 6 6
          2 12 12
          3 18 18
          4 24 24
          5 1 42
          6 7 48
          7 13 54
          8 19 60


          The correlation between y and x is $0$. If I draw the least squares line, it's perfectly horizontal and the $R^2$ is naturally going to be $0$.



          But when you add a new variable g, which indicates which of two groups the observations came from, x becomes extremely informative:



           y x g
          1 6 6 0
          2 12 12 0
          3 18 18 0
          4 24 24 0
          5 1 42 1
          6 7 48 1
          7 13 54 1
          8 19 60 1


          The $R^2$ of a linear regression model with both the x and g variables in it will be 1.



          Plot of y vs x showing a lack of pairwise linear relationship but with color indicating the group; within each group the relationship is perfect



          It's possible for something this sort of thing to happen with every one of the variables in the model - that all have small pairwise correlation with the response, yet the model with them all in there is very good at predicting the response.



          Additional reading:



          https://en.wikipedia.org/wiki/Omitted-variable_bias



          https://en.wikipedia.org/wiki/Simpson%27s_paradox







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited 14 hours ago

























          answered 14 hours ago









          Glen_b♦

          202k22380707




          202k22380707






















              up vote
              2
              down vote













              I assume you are training a multiple regression model, in which you have multiple independent variables $X_1$, $X_2$, ..., regressed on Y. The simple answer here is a pairwise correlation is like running an underspecified regression model. As such, you omitted important variables.



              More specifically, when you state "there is no variable with a good correlation with the predicted variable", it sounds like you are checking the pairwise correlation between each independent variable with the dependent variable, Y. This is possible when $X_2$ brings in important, new information and helps clear up the confounding between $X_1$ and Y. With that confounding, though, we may not see a linear pair-wise correlation between $X_1$ and Y. You may also want to check the relationship between partial correlation $rho_x_2$ and multiple regression $y=beta_1X_1 +beta_2X_2 + epsilon$. Multiple regression have a more close relationship with partial correlation than pairwise correlation, $rho_x_1,y$.






              share|cite|improve this answer








              New contributor




              Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.





















                up vote
                2
                down vote













                I assume you are training a multiple regression model, in which you have multiple independent variables $X_1$, $X_2$, ..., regressed on Y. The simple answer here is a pairwise correlation is like running an underspecified regression model. As such, you omitted important variables.



                More specifically, when you state "there is no variable with a good correlation with the predicted variable", it sounds like you are checking the pairwise correlation between each independent variable with the dependent variable, Y. This is possible when $X_2$ brings in important, new information and helps clear up the confounding between $X_1$ and Y. With that confounding, though, we may not see a linear pair-wise correlation between $X_1$ and Y. You may also want to check the relationship between partial correlation $rho_x_2$ and multiple regression $y=beta_1X_1 +beta_2X_2 + epsilon$. Multiple regression have a more close relationship with partial correlation than pairwise correlation, $rho_x_1,y$.






                share|cite|improve this answer








                New contributor




                Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.



















                  up vote
                  2
                  down vote










                  up vote
                  2
                  down vote









                  I assume you are training a multiple regression model, in which you have multiple independent variables $X_1$, $X_2$, ..., regressed on Y. The simple answer here is a pairwise correlation is like running an underspecified regression model. As such, you omitted important variables.



                  More specifically, when you state "there is no variable with a good correlation with the predicted variable", it sounds like you are checking the pairwise correlation between each independent variable with the dependent variable, Y. This is possible when $X_2$ brings in important, new information and helps clear up the confounding between $X_1$ and Y. With that confounding, though, we may not see a linear pair-wise correlation between $X_1$ and Y. You may also want to check the relationship between partial correlation $rho_x_2$ and multiple regression $y=beta_1X_1 +beta_2X_2 + epsilon$. Multiple regression have a more close relationship with partial correlation than pairwise correlation, $rho_x_1,y$.






                  share|cite|improve this answer








                  New contributor




                  Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  I assume you are training a multiple regression model, in which you have multiple independent variables $X_1$, $X_2$, ..., regressed on Y. The simple answer here is a pairwise correlation is like running an underspecified regression model. As such, you omitted important variables.



                  More specifically, when you state "there is no variable with a good correlation with the predicted variable", it sounds like you are checking the pairwise correlation between each independent variable with the dependent variable, Y. This is possible when $X_2$ brings in important, new information and helps clear up the confounding between $X_1$ and Y. With that confounding, though, we may not see a linear pair-wise correlation between $X_1$ and Y. You may also want to check the relationship between partial correlation $rho_x_2$ and multiple regression $y=beta_1X_1 +beta_2X_2 + epsilon$. Multiple regression have a more close relationship with partial correlation than pairwise correlation, $rho_x_1,y$.







                  share|cite|improve this answer








                  New contributor




                  Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|cite|improve this answer



                  share|cite|improve this answer






                  New contributor




                  Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 14 hours ago









                  Ray Yang

                  264




                  264




                  New contributor




                  Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  Ray Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366666%2fa-good-linear-regression-model-without-correlation-between-the-predictors-and-th%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      List of Gilmore Girls characters

                      What does second last employer means? [closed]

                      One-line joke