Timestamps in Ridge Regression Scikit Learn

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.



My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:



The field containing the target data / labels is elapsed_time, which is expressed in seconds.



import pandas as pd
import sklearn.linear_model as linear_model

delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]


df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])


The resulting DataFrame looks like this:



 order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918


I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.



I suspect that time of day is predictive but that date is less predictive.



So far, I am considering extracting only the hour from the time stamp. In this example, order_time would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.



Is there a better way to transform this datetime data?



Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?










share|cite|improve this question









New contributor




Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    up vote
    1
    down vote

    favorite












    I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.



    My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:



    The field containing the target data / labels is elapsed_time, which is expressed in seconds.



    import pandas as pd
    import sklearn.linear_model as linear_model

    delivery_data =
    'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
    'price' : [34.54, 8.63, 21.24],
    'miles' : [6, 3, 7],
    'home_type' : ['apartment', 'house', 'apartment'],
    'elapsed_time' : [2023, 1610, 1918]


    df = pd.DataFrame(delivery_data)
    df['order_time'] = pd.to_datetime(df['order_time'])


    The resulting DataFrame looks like this:



     order_time price miles home_type elapsed_time
    0 2018-09-12 21:43:08 34.54 6 apartment 2023
    1 2018-09-13 06:33:04 8.63 3 house 1610
    2 2018-09-13 09:12:18 21.24 7 apartment 1918


    I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.



    I suspect that time of day is predictive but that date is less predictive.



    So far, I am considering extracting only the hour from the time stamp. In this example, order_time would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.



    Is there a better way to transform this datetime data?



    Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?










    share|cite|improve this question









    New contributor




    Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.



      My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:



      The field containing the target data / labels is elapsed_time, which is expressed in seconds.



      import pandas as pd
      import sklearn.linear_model as linear_model

      delivery_data =
      'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
      'price' : [34.54, 8.63, 21.24],
      'miles' : [6, 3, 7],
      'home_type' : ['apartment', 'house', 'apartment'],
      'elapsed_time' : [2023, 1610, 1918]


      df = pd.DataFrame(delivery_data)
      df['order_time'] = pd.to_datetime(df['order_time'])


      The resulting DataFrame looks like this:



       order_time price miles home_type elapsed_time
      0 2018-09-12 21:43:08 34.54 6 apartment 2023
      1 2018-09-13 06:33:04 8.63 3 house 1610
      2 2018-09-13 09:12:18 21.24 7 apartment 1918


      I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.



      I suspect that time of day is predictive but that date is less predictive.



      So far, I am considering extracting only the hour from the time stamp. In this example, order_time would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.



      Is there a better way to transform this datetime data?



      Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?










      share|cite|improve this question









      New contributor




      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.



      My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:



      The field containing the target data / labels is elapsed_time, which is expressed in seconds.



      import pandas as pd
      import sklearn.linear_model as linear_model

      delivery_data =
      'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
      'price' : [34.54, 8.63, 21.24],
      'miles' : [6, 3, 7],
      'home_type' : ['apartment', 'house', 'apartment'],
      'elapsed_time' : [2023, 1610, 1918]


      df = pd.DataFrame(delivery_data)
      df['order_time'] = pd.to_datetime(df['order_time'])


      The resulting DataFrame looks like this:



       order_time price miles home_type elapsed_time
      0 2018-09-12 21:43:08 34.54 6 apartment 2023
      1 2018-09-13 06:33:04 8.63 3 house 1610
      2 2018-09-13 09:12:18 21.24 7 apartment 1918


      I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.



      I suspect that time of day is predictive but that date is less predictive.



      So far, I am considering extracting only the hour from the time stamp. In this example, order_time would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.



      Is there a better way to transform this datetime data?



      Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?







      regression predictive-models scikit-learn feature-construction circular-statistics






      share|cite|improve this question









      New contributor




      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|cite|improve this question









      New contributor




      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|cite|improve this question




      share|cite|improve this question








      edited 5 hours ago









      whuber♦

      195k31417778




      195k31417778






      New contributor




      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 5 hours ago









      Jacob Quisenberry

      1084




      1084




      New contributor




      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).



          If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:



          $$t_x = cos left( fracpi12 t right), quad
          t_y = sin left( fracpi12 t right)$$






          share|cite|improve this answer
















          • 1




            I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
            – Jacob Quisenberry
            1 hour ago

















          up vote
          1
          down vote













          From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.



          If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/



          On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night






          share|cite|improve this answer




















          • There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
            – whuber♦
            3 hours ago










          • +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
            – Jacob Quisenberry
            1 hour ago










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.









           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366830%2ftimestamps-in-ridge-regression-scikit-learn%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote



          accepted










          There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).



          If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:



          $$t_x = cos left( fracpi12 t right), quad
          t_y = sin left( fracpi12 t right)$$






          share|cite|improve this answer
















          • 1




            I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
            – Jacob Quisenberry
            1 hour ago














          up vote
          3
          down vote



          accepted










          There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).



          If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:



          $$t_x = cos left( fracpi12 t right), quad
          t_y = sin left( fracpi12 t right)$$






          share|cite|improve this answer
















          • 1




            I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
            – Jacob Quisenberry
            1 hour ago












          up vote
          3
          down vote



          accepted







          up vote
          3
          down vote



          accepted






          There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).



          If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:



          $$t_x = cos left( fracpi12 t right), quad
          t_y = sin left( fracpi12 t right)$$






          share|cite|improve this answer












          There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).



          If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:



          $$t_x = cos left( fracpi12 t right), quad
          t_y = sin left( fracpi12 t right)$$







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 3 hours ago









          user20160

          13.7k12250




          13.7k12250







          • 1




            I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
            – Jacob Quisenberry
            1 hour ago












          • 1




            I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
            – Jacob Quisenberry
            1 hour ago







          1




          1




          I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
          – Jacob Quisenberry
          1 hour ago




          I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
          – Jacob Quisenberry
          1 hour ago












          up vote
          1
          down vote













          From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.



          If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/



          On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night






          share|cite|improve this answer




















          • There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
            – whuber♦
            3 hours ago










          • +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
            – Jacob Quisenberry
            1 hour ago














          up vote
          1
          down vote













          From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.



          If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/



          On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night






          share|cite|improve this answer




















          • There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
            – whuber♦
            3 hours ago










          • +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
            – Jacob Quisenberry
            1 hour ago












          up vote
          1
          down vote










          up vote
          1
          down vote









          From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.



          If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/



          On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night






          share|cite|improve this answer












          From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.



          If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/



          On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 3 hours ago









          hellpanderrr

          2602312




          2602312











          • There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
            – whuber♦
            3 hours ago










          • +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
            – Jacob Quisenberry
            1 hour ago
















          • There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
            – whuber♦
            3 hours ago










          • +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
            – Jacob Quisenberry
            1 hour ago















          There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
          – whuber♦
          3 hours ago




          There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
          – whuber♦
          3 hours ago












          +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
          – Jacob Quisenberry
          1 hour ago




          +1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
          – Jacob Quisenberry
          1 hour ago










          Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.









           

          draft saved


          draft discarded


















          Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.












          Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.











          Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.













           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366830%2ftimestamps-in-ridge-regression-scikit-learn%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          Long meetings (6-7 hours a day): Being “babysat” by supervisor

          Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

          Confectionery