Binary predictor with highly skewed distribution

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!










share|cite|improve this question





























    up vote
    2
    down vote

    favorite












    I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



    Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!










    share|cite|improve this question

























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



      Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!










      share|cite|improve this question















      I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



      Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!







      regression binary-data skewness predictor






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 2 hours ago

























      asked 2 hours ago









      curiousmind

      10618




      10618




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote













          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer
















          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            2 hours ago






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            2 hours ago






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            1 hour ago










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f370017%2fbinary-predictor-with-highly-skewed-distribution%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote













          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer
















          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            2 hours ago






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            2 hours ago






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            1 hour ago














          up vote
          3
          down vote













          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer
















          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            2 hours ago






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            2 hours ago






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            1 hour ago












          up vote
          3
          down vote










          up vote
          3
          down vote









          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer












          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 2 hours ago









          jbowman

          22.7k24178




          22.7k24178







          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            2 hours ago






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            2 hours ago






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            1 hour ago












          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            2 hours ago






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            2 hours ago






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            1 hour ago







          1




          1




          Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
          – curiousmind
          2 hours ago




          Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
          – curiousmind
          2 hours ago




          2




          2




          Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
          – jbowman
          2 hours ago




          Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
          – jbowman
          2 hours ago




          1




          1




          Thank you. This answer is helpful.
          – curiousmind
          1 hour ago




          Thank you. This answer is helpful.
          – curiousmind
          1 hour ago

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f370017%2fbinary-predictor-with-highly-skewed-distribution%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          Long meetings (6-7 hours a day): Being “babysat” by supervisor

          Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

          Confectionery