Timestamps in Ridge Regression Scikit Learn

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
1
down vote

favorite

I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.

My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:

The field containing the target data / labels is elapsed_time, which is expressed in seconds.

import pandas as pd
import sklearn.linear_model as linear_model

delivery_data = 
 'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
 'price' : [34.54, 8.63, 21.24],
 'miles' : [6, 3, 7],
 'home_type' : ['apartment', 'house', 'apartment'],
 'elapsed_time' : [2023, 1610, 1918]


df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])

The resulting DataFrame looks like this:

 order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918

I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.

I suspect that time of day is predictive but that date is less predictive.

So far, I am considering extracting only the hour from the time stamp. In this example, order_time would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.

Is there a better way to transform this datetime data?

Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?

edited 5 hours ago

whuberâ™¦

195k31417778

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

add a commentÂ |Â

up vote
1
down vote

favorite

I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.

My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:

The field containing the target data / labels is elapsed_time, which is expressed in seconds.

import pandas as pd
import sklearn.linear_model as linear_model

delivery_data = 
 'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
 'price' : [34.54, 8.63, 21.24],
 'miles' : [6, 3, 7],
 'home_type' : ['apartment', 'house', 'apartment'],
 'elapsed_time' : [2023, 1610, 1918]


df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])

The resulting DataFrame looks like this:

 order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918

I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.

I suspect that time of day is predictive but that date is less predictive.

Is there a better way to transform this datetime data?

Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?

edited 5 hours ago

whuberâ™¦

195k31417778

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

add a commentÂ |Â

up vote
1
down vote

favorite

I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.

My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:

The field containing the target data / labels is elapsed_time, which is expressed in seconds.

import pandas as pd
import sklearn.linear_model as linear_model

delivery_data = 
 'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
 'price' : [34.54, 8.63, 21.24],
 'miles' : [6, 3, 7],
 'home_type' : ['apartment', 'house', 'apartment'],
 'elapsed_time' : [2023, 1610, 1918]


df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])

The resulting DataFrame looks like this:

 order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918

I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.

I suspect that time of day is predictive but that date is less predictive.

Is there a better way to transform this datetime data?

Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?

edited 5 hours ago

whuberâ™¦

195k31417778

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.

My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:

The field containing the target data / labels is elapsed_time, which is expressed in seconds.

import pandas as pd
import sklearn.linear_model as linear_model

delivery_data = 
 'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
 'price' : [34.54, 8.63, 21.24],
 'miles' : [6, 3, 7],
 'home_type' : ['apartment', 'house', 'apartment'],
 'elapsed_time' : [2023, 1610, 1918]


df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])

The resulting DataFrame looks like this:

 order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918

I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.

I suspect that time of day is predictive but that date is less predictive.

Is there a better way to transform this datetime data?

Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?

regression predictive-models scikit-learn feature-construction circular-statistics

edited 5 hours ago

whuberâ™¦

195k31417778

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

edited 5 hours ago

whuberâ™¦

195k31417778

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

edited 5 hours ago

whuberâ™¦

195k31417778

edited 5 hours ago

whuberâ™¦

195k31417778

edited 5 hours ago

whuberâ™¦

195k31417778

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

asked 5 hours ago

Jacob Quisenberry

1084

asked 5 hours ago

Jacob Quisenberry

1084

New contributor

Jacob Quisenberry is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).

If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:

$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$

answered 3 hours ago

user20160

13.7k12250

1

I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

up vote
1
down vote

From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.

If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/

On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night

answered 3 hours ago

hellpanderrr

2602312

There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â€“Â whuberâ™¦
3 hours ago

+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f366830%2ftimestamps-in-ridge-regression-scikit-learn%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$

answered 3 hours ago

user20160

13.7k12250

1

I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

up vote
3
down vote

accepted

$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$

answered 3 hours ago

user20160

13.7k12250

1

I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

up vote
3
down vote

accepted

$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$

answered 3 hours ago

user20160

13.7k12250

$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$

answered 3 hours ago

user20160

13.7k12250

answered 3 hours ago

user20160

13.7k12250

answered 3 hours ago

user20160

13.7k12250

answered 3 hours ago

user20160

13.7k12250

1

I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

1

I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â€“Â Jacob Quisenberry
1 hour ago

I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

up vote
1
down vote

From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.

On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night

answered 3 hours ago

hellpanderrr

2602312

There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â€“Â whuberâ™¦
3 hours ago

+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

up vote
1
down vote

From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.

On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night

answered 3 hours ago

hellpanderrr

2602312

There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â€“Â whuberâ™¦
3 hours ago

+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

up vote
1
down vote

From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.

On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night

answered 3 hours ago

hellpanderrr

2602312

From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.

On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night

answered 3 hours ago

hellpanderrr

2602312

answered 3 hours ago

hellpanderrr

2602312

answered 3 hours ago

hellpanderrr

2602312

answered 3 hours ago

hellpanderrr

2602312

There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â€“Â whuberâ™¦
3 hours ago

+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â€“Â whuberâ™¦
3 hours ago

+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â€“Â Jacob Quisenberry
1 hour ago

There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
â€“Â whuberâ™¦
3 hours ago

+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
â€“Â Jacob Quisenberry
1 hour ago

add a commentÂ |Â

Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Jacob Quisenberry is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky