Timestamps in Ridge Regression Scikit Learn
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model
.
My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:
The field containing the target data / labels is elapsed_time
, which is expressed in seconds.
import pandas as pd
import sklearn.linear_model as linear_model
delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]
df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])
The resulting DataFrame looks like this:
order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918
I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.
I suspect that time of day is predictive but that date is less predictive.
So far, I am considering extracting only the hour from the time stamp. In this example, order_time
would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.
Is there a better way to transform this datetime
data?
Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?
regression predictive-models scikit-learn feature-construction circular-statistics
New contributor
add a comment |Â
up vote
1
down vote
favorite
I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model
.
My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:
The field containing the target data / labels is elapsed_time
, which is expressed in seconds.
import pandas as pd
import sklearn.linear_model as linear_model
delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]
df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])
The resulting DataFrame looks like this:
order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918
I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.
I suspect that time of day is predictive but that date is less predictive.
So far, I am considering extracting only the hour from the time stamp. In this example, order_time
would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.
Is there a better way to transform this datetime
data?
Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?
regression predictive-models scikit-learn feature-construction circular-statistics
New contributor
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model
.
My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:
The field containing the target data / labels is elapsed_time
, which is expressed in seconds.
import pandas as pd
import sklearn.linear_model as linear_model
delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]
df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])
The resulting DataFrame looks like this:
order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918
I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.
I suspect that time of day is predictive but that date is less predictive.
So far, I am considering extracting only the hour from the time stamp. In this example, order_time
would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.
Is there a better way to transform this datetime
data?
Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?
regression predictive-models scikit-learn feature-construction circular-statistics
New contributor
I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model
.
My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:
The field containing the target data / labels is elapsed_time
, which is expressed in seconds.
import pandas as pd
import sklearn.linear_model as linear_model
delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]
df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])
The resulting DataFrame looks like this:
order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918
I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.
I suspect that time of day is predictive but that date is less predictive.
So far, I am considering extracting only the hour from the time stamp. In this example, order_time
would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.
Is there a better way to transform this datetime
data?
Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?
regression predictive-models scikit-learn feature-construction circular-statistics
regression predictive-models scikit-learn feature-construction circular-statistics
New contributor
New contributor
edited 5 hours ago
whuberâ¦
195k31417778
195k31417778
New contributor
asked 5 hours ago
Jacob Quisenberry
1084
1084
New contributor
New contributor
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
3
down vote
accepted
There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).
If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:
$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$
1
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
add a comment |Â
up vote
1
down vote
From order_time
you could also extract a categorical variable day of week
or binary workday
, assuming that traffic is heavier during the workdays.
If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/
On the other hand, you could create a more broader categorical variable, like part of day
, with values e.g. morning
, noon
, evening
, night
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
+1 for poining out thatday of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â Jacob Quisenberry
1 hour ago
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).
If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:
$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$
1
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
add a comment |Â
up vote
3
down vote
accepted
There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).
If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:
$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$
1
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).
If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:
$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$
There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).
If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:
$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$
answered 3 hours ago
user20160
13.7k12250
13.7k12250
1
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
add a comment |Â
1
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
1
1
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â Jacob Quisenberry
1 hour ago
add a comment |Â
up vote
1
down vote
From order_time
you could also extract a categorical variable day of week
or binary workday
, assuming that traffic is heavier during the workdays.
If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/
On the other hand, you could create a more broader categorical variable, like part of day
, with values e.g. morning
, noon
, evening
, night
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
+1 for poining out thatday of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â Jacob Quisenberry
1 hour ago
add a comment |Â
up vote
1
down vote
From order_time
you could also extract a categorical variable day of week
or binary workday
, assuming that traffic is heavier during the workdays.
If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/
On the other hand, you could create a more broader categorical variable, like part of day
, with values e.g. morning
, noon
, evening
, night
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
+1 for poining out thatday of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â Jacob Quisenberry
1 hour ago
add a comment |Â
up vote
1
down vote
up vote
1
down vote
From order_time
you could also extract a categorical variable day of week
or binary workday
, assuming that traffic is heavier during the workdays.
If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/
On the other hand, you could create a more broader categorical variable, like part of day
, with values e.g. morning
, noon
, evening
, night
From order_time
you could also extract a categorical variable day of week
or binary workday
, assuming that traffic is heavier during the workdays.
If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/
On the other hand, you could create a more broader categorical variable, like part of day
, with values e.g. morning
, noon
, evening
, night
answered 3 hours ago
hellpanderrr
2602312
2602312
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
+1 for poining out thatday of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â Jacob Quisenberry
1 hour ago
add a comment |Â
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
+1 for poining out thatday of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â Jacob Quisenberry
1 hour ago
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â whuberâ¦
3 hours ago
+1 for poining out that
day of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.â Jacob Quisenberry
1 hour ago
+1 for poining out that
day of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.â Jacob Quisenberry
1 hour ago
add a comment |Â
Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.
Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.
Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.
Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366830%2ftimestamps-in-ridge-regression-scikit-learn%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password