I'm pretty new to time series.
This is the dataset I'm working on:
Date Price Location
0 2012-01-01 1771.0 Marche
1 2012-01-01 1039.0 Calabria
2 2012-01-01 2193.0 Campania
3 2012-01-01 2015.0 Emilia-Romagna
4 2012-01-01 1483.0 Friuli-Venezia Giulia
... ... ... ...
2475 2022-04-01 1963.0 Lazio
2476 2022-04-01 1362.0 Friuli-Venezia Giulia
2477 2022-04-01 1674.0 Emilia-Romagna
2478 2022-04-01 1388.0 Marche
2479 2022-04-01 1103.0 Abruzzo
I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby?
What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.
Thanks in advance.
Suppose my dataset (df) is analogous to yours:
Date Price Location
0 2021-01-01 791.076890 Campania
1 2021-01-01 705.702464 Lombardia
2 2021-01-01 719.991382 Sicilia
3 2021-02-01 825.760917 Lombardia
4 2021-02-01 747.734309 Sicilia
... ... ... ...
31 2021-11-01 886.874348 Lombardia
32 2021-11-01 935.040583 Campania
33 2021-12-01 771.165378 Sicilia
34 2021-12-01 952.255227 Campania
35 2021-12-01 939.754515 Lombardia
In my case I have a Price record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df as:
df = df.set_index(["Date", "Location"]).Price.unstack()
Now my dataset is like:
Location Campania Lombardia Sicilia
Date
2021-01-01 791.076890 705.702464 719.991382
2021-02-01 758.872755 825.760917 747.734309
2021-03-01 880.038005 803.165998 837.738419
... ... ... ...
2021-10-01 908.402345 805.081193 792.369610
2021-11-01 935.040583 886.874348 736.862025
2021-12-01 952.255227 939.754515 771.165378
Please, after this, make sure there are no NaN values (df.isna().sum()).
Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)
The last part is to build a matching (X, Y) for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.
Related
I'm trying to change a date frame with the following contents:
Date
Change
1802
2017-09-14
-1.14%
462
2021-05-16
NaN
935
2020-01-29
0.04%
713
2020-09-07
2.39%
1471
2018-08-11
NaN
[1460 rows × 2 columns]
Into this:
TimeSeries (DataArray) (Month: 144component: 1sample: 1)
array([[[112.]],
[[118.]],
[[132.]],
[[129.]],
[[121.]],
[[135.]],
[[148.]],
[[148.]],
[[136.]],
Coordinates:
Month
(Month)
datetime64[ns].
2019-01-01 ... 2021-12-01
component
(component)
object
'Change'
Attributes:
static_covariates: None
hierarchy: None
In order to run a neural network model on multiple time series.
Any help or advice is greatly appreciated!
The solution required removing the '%' sign from the column values. Then converting the column to a float.
ftse_change['Change'] = ftse_change['Change'].str.rstrip('%').astype('float') / 100.0
did the trick
I am trying to calculate the regression coefficient of weight for every animal_id and cycle_nr in my df:
animal_id
cycle_nr
feed_date
weight
1003
8
2020-02-06
221
1003
8
2020-02-10
226
1003
8
2020-02-14
230
1004
1
2020-02-20
231
1004
1
2020-02-21
243
What I tried using this source source:
import pandas as pd
import statsmodels.api as sm
def GroupRegress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'feed_date', ['weight'])
This code fails because my variable includes a date.
What I tried next:
I figured I could create a numeric column to use instead of my date column. I created a simple count_id column:
animal_id
cycle_nr
feed_date
weight
id
1003
8
2020-02-06
221
1
1003
8
2020-02-10
226
2
1003
8
2020-02-14
230
3
1004
1
2020-02-20
231
4
1004
1
2020-02-21
243
5
Then I ran my regression on this column
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'id', ['weight'])
The slope calculation looks good, but the intercept makes of course no sense.
Then I realized that this method is only useable when the interval between measurements is regular. In most cases the interval is 7 days, but somethimes it is 10, 14 or 21 days.
I dropped records where the interval was not 7 days and re-ran my regression...It works, but I hate that I have to throw away perfectly fine data.
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates. Any suggestions?
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates.
If the feed dates are strings make a datetime Series using pandas.to_datetime.
Use that new Series to calculate the actual time difference between feedings
Use the resultant timedeltas in your regression instead of a linear fabricated sequence. The timedeltas have different attributes, (i.e. microseconds, days), that can be used depending on the resolution you need.
My first instinct would be to produce the Timedeltas for each group separately. The first feeding in each group would of course be time zero.
Making the Timedeltas may not even be necessary - there are probably datetime aware regression methods in Numpy or Scipy or maybe even Pandas - I imagine there would have to be, it is a common enough application.
Instead of Timedeltas the datetime Series could be converted to ordinal values for use in the regression.
df = pd.DataFrame(
{
"feed_date": [
"2020-02-06",
"2020-02-10",
"2020-02-14",
"2020-02-20",
"2020-02-21",
]
}
)
>>> q = pd.to_datetime(df.feed_date)
>>> q
0 2020-02-06
1 2020-02-10
2 2020-02-14
3 2020-02-20
4 2020-02-21
Name: feed_date, dtype: datetime64[ns]
>>> q.apply(pd.Timestamp.toordinal)
0 737461
1 737465
2 737469
3 737475
4 737476
Name: feed_date, dtype: int64
>>>
I have 2 tables.
Table A has 105 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000J9HHN8 2018-12-31 13562.328 0.000000
1 BBG000J9HHN8 2019-01-07 34717.536 1.559851
2 BBG000J9HHN8 2019-01-14 28300.218 -0.184844
3 BBG000J9HHN8 2019-01-21 35370.134 0.249818
4 BBG000J9HHN8 2019-01-28 36104.512 0.020763
... ... ... ... ...
100 BBG000J9HHN8 2020-11-30 62065.827 0.278765
101 BBG000J9HHN8 2020-12-07 62145.445 0.001283
102 BBG000J9HHN8 2020-12-14 63516.146 0.022056
103 BBG000J9HHN8 2020-12-21 51283.187 -0.192596
104 BBG000J9HHN8 2020-12-28 51306.951 0.000463
Table B has 257970 rows:
bbgid dt weekly_price_per_stock weekly_pct_change
0 BBG000B9WJ55 2018-12-31 34.612737 0.000000
1 BBG000B9WJ55 2019-01-07 70.618471 1.040245
2 BBG000B9WJ55 2019-01-14 89.123337 0.262040
3 BBG000B9WJ55 2019-01-21 90.377643 0.014074
4 BBG000B9WJ55 2019-01-28 90.527678 0.001660
... ... ... ... ...
257965 BBG00YFR2NJ6 2020-12-21 30.825000 -0.251275
257966 BBG00YFR2NJ6 2020-12-28 40.960000 0.328792
257967 BBG00YM46B38 2020-12-14 0.155900 -0.996194
257968 BBG00YM46B38 2020-12-21 0.372860 1.391661
257969 BBG00YM46B38 2020-12-28 0.535650 0.436598
In table A there's only a group of stocks (CCPM) but in table B i have a lot of different stock groups. I want to run a linear regression of table B pct_change vs table A (CCPM) pct_change so i can know how the stocks in table B move with respect to CCPM stocks during the period of time in the dt column. The problem is that i only have 105 rows in table A and when i group table B by bbgid i always get more rows so i'm having a error that says X and y must be the same size.
Both tables have been previously grouped by week and their pct_change has been calculated weekly. I should compare the variations in pct_change from table B with those on table A based on date and one group at a time from table B vs the CCPM stocks' pct_change.
I would like to extract the slope from each regression and store them in a column inside the same table and associate it to its corresponding group.
I have tried the solutions in this post and this post without success.
Is there any workaround to do this or i'm a doing something wrong? Please help me fix this.
Thank you very much in advance.
Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:
> df.head()
Date y wind temperature
1 2019-10-03 00:00:00 33 12 15
2 2019-10-03 01:00:00 10 5 6
3 2019-10-03 02:00:00 39 6 5
4 2019-10-03 03:00:00 60 13 4
5 2019-10-03 04:00:00 21 3 7
I want to predict y based on the wind and temperature. We then do a split something like this:
df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)
And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)
The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.
You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.
I have 2 datasets for which there are multiple repeated rows for each date due to different recording time of each health attribute pertaining to that date.
I want to shrink my dataset to aggregate the values of each column pertaining to the same day. I don't want to create a new data frame because I need to then merge the datasets with other datasets. After trying the code below, my df's shape still returns the same number of rows. Any help would be appreciated.
sample data:
Data Head
Output snapshot
count calorie update_time speed distance date
101 4.290000 2018-04-30 18:35:00.291 1.527778 78.420000 2018-04-30
25 0.960000 2018-04-13 19:55:00.251 1.027778 14.360000 2018-04-13
38 1.530000 2018-04-02 10:14:58.210 1.194444 24.190000 2018-04-02
35 1.450000 2018-04-27 10:55:01.281 1.500000 27.450000 2018-04-27
0 0.000000 2018-04-21 13:46:36.801 0.000000 0.000000 2018-04-21
34 1.820000 2018-04-01 08:35:05.481 2.222222 30.260000 2018-04-01
df_SC['date']=df_SC.groupby('date').agg({"distance": "sum","calorie":"sum",
"count":"sum","speed":"mean"}).reset_index()
expect sum of distance, calorie, count and mean of speed to show up under each respective column and against each data.