I am currently using the Statsmodels library for forecasting. I am trying to run triple exponential smoothing with seasonality, trend, and a smoothing factor, but keep getting errors. Here is a sample pandas series with the frequency already set:
2018-01-01 25
2018-02-01 30
2018-03-01 40
2018-04-01 38
2018-05-01 33
2018-06-01 36
2018-07-01 34
2018-08-01 35
2018-09-01 37
2018-10-01 41
2018-11-01 36
2018-12-01 32
2019-01-01 31
2019-02-01 29
2019-03-01 28
2019-04-01 29
2019-05-01 30
Freq: MS, dtype:float64
Here is my code:
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
def triple_expo(input_data_set, output_list1):
data = input_data_set
model1 = ExponentialSmoothing(data, trend='add', seasonal='mul', damped=False, seasonal_periods=12, freq='M').fit(smoothing_level=0.1, smoothing_slope=0.1, smoothing_seasonal=0.1, optimized=False)
fcast1 = model1.forecast(12)
fcast_list1 = list(fcast1)
output_list1.append(fcast_list1)
for product in unique_products:
product_slice = sorted_product_df["Product"] == product
unique_slice = sorted_product_df[product_slice]
amount = unique_slice["Amount"].tolist()
index = unique_slice["Date"].tolist()
unique_series = pd.Series(amount, index)
unique_series.index = pd.DatetimeIndex(unique_series.index, freq=pd.infer_freq(unique_series.index))
triple_expo(unique_series, triple_out_one)
I originally did not have a frequency argument at all because I was following the example on the statsmodels website which is here: http://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html
They did not pass a frequency argument at all as they inferred it using the DateTimeIndex in pandas. When I have no frequency argument, my error is "operands could not be broadcast together with shapes <5,><12,>". I have 17 months of data and pandas recognizes it as 'Ms'. The 5 and the 12 is referring to the 17. I then pass freq='M' like I have in the code sample, and I get "The given frequency argument is incompatible with the given index". I then try setting it to everything from len(data) to len(data)-1 and always get errors. I tried the len(data)-1 because I was originally referencing this stack post: seasonal_decompose: operands could not be broadcast together with shapes on a series
In that post, he said it would work if you set the frequency to one less than the length of the data set. It does not work for me though. Any help would be appreciated. Thanks!
EDIT: The comment below suggested I include more than one year's worth of data and it worked when I did that.
Related
I have a data frame like that (it's just the head) :
Timestamp Function_code Node_id Delta
0 2000-01-01 10:39:51.790683 Tx_PDO_2 54 551.0
1 2000-01-01 10:39:51.791650 Tx_PDO_2 54 601.0
2 2000-01-01 10:39:51.792564 Tx_PDO_3 54 545.0
3 2000-01-01 10:39:51.793511 Tx_PDO_3 54 564.0
There are only two types of Function_code : Tx_PDO_2 and Tx_PDO_3
I plot in two windows, a graph with Timestamp on the x-axis and Delta on the y-axis. One for Tx_PDO_2 and the other for Tx_PDO_3 :
delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta", )
Now, I want to know which window corresponds to which Function_code
I tried to use title=delta_rx_tx_df.groupby("Function_code").groups but it did not work.
There may be a better way, but for starters, you can assign the titles to the plots after they are created:
plots = delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta")
plots.reset_index()\
.apply(lambda x: x[0].set_title(x['Function_code']), axis=1)
I want to integrate the following dataframe, such that I have the integrated value for every hour. I have roughly a 10s sampling rate, but if it is necissary to have an even timeinterval, I guess I can just use df.resample().
Timestamp Power [W]
2022-05-05 06:00:05+02:00 2.0
2022-05-05 06:00:15+02:00 1.2
2022-05-05 06:00:25+02:00 0.3
2022-05-05 06:00:35+02:00 4.3
2022-05-05 06:00:45+02:00 1.1
...
2022-05-06 20:59:19+02:00 1.4
2022-05-06 20:59:29+02:00 2.0
2022-05-06 20:59:39+02:00 4.1
2022-05-06 20:59:49+02:00 1.3
2022-05-06 20:59:59+02:00 0.8
So I want to be able to integrate over both hours and days, so my output could look like:
Timestamp Energy [Wh]
2022-05-05 07:00:00+02:00 some values
2022-05-05 08:00:00+02:00 .
2022-05-05 09:00:00+02:00 .
2022-05-05 10:00:00+02:00 .
2022-05-05 11:00:00+02:00
...
2022-05-06 20:00:00+02:00
2022-05-06 21:00:00+02:00
(hour 07:00 is to include values between 06:00-07:00, and so on...)
and
Timestamp Energy [Wh]
2022-05-05 .
2022-05-06 .
So how do I achieve this? I was thinking I could use scipy.integrate, but my outputs look a bit weird.
Thank you.
You could create a new column representing your Timestamp truncated to hours:
df['Timestamp_hour'] = df['Timestamp'].dt.floor('h')
Please note that in that case, the rows between hour 6.00 to hour 6.59 will be included into the 6 hour and not the 7 one.
Then you can group your rows by your new column before applying your integration computation:
df_integrated_hour = (
df
.groupby('Timestamp_hour')
.agg({
'Power': YOUR_INTEGRATION_FUNCTION
})
.rename(columns={'Power': 'Energy'})
.reset_index()
)
Hope this will help you
Here's a very simple solution using rectangle integration with rectangles spaced in 10 second intervals starting at zero and therefore NOT centered exactly on the data points (assuming that the data is delivered in regular intervals and no data is missing), a.k.a. a simple average.
from numpy import random
import pandas as pd
times = pd.date_range('2022-05-05 06:00:04+02:00', '2022-05-06 21:00:00+02:00', freq='10S')
watts = random.rand(len(times)) * 5
df = pd.DataFrame(index=times, data=watts, columns=["Power [W]"])
hourly = df.groupby([df.index.date, df.index.hour]).mean()
hourly.columns = ["Energy [Wh]"]
print(hourly)
hours_in_a_day = 24 # add special casing for leap days here, if required
daily = df.groupby(df.index.date).mean()
daily.columns = ["Energy [Wh]"]
print(daily)
Output:
Energy [Wh]
2022-05-05 6 2.625499
7 2.365678
8 2.579349
9 2.569170
10 2.543611
11 2.742332
12 2.478145
13 2.444210
14 2.507821
15 2.485770
16 2.414057
17 2.567755
18 2.393725
19 2.609375
20 2.525746
21 2.421578
22 2.520466
23 2.653466
2022-05-06 0 2.559110
1 2.519032
2 2.472282
3 2.436023
4 2.378289
5 2.549572
6 2.558478
7 2.470721
8 2.429454
9 2.390543
10 2.538194
11 2.537564
12 2.492308
13 2.387632
14 2.435582
15 2.581616
16 2.389549
17 2.461523
18 2.576084
19 2.523577
20 2.572270
Energy [Wh]
2022-05-05 60.597007
2022-05-06 59.725029
Trapezoidal integration should give a slightly better approximation but it's harder to implement right. You'd have to deal carefully with the hour boundaries. That's basically just a matter of inserting interpolated values twice at the full hour (at 09:59:59.999 and 10:00:00). But then you'd also have to figure out a way to extrapolate to the start and end of the range, i.e. in your example go from 06:00:05 to 06:00:00. But careful, what to do if your measurements only start somewhere in the middle like 06:17:23?
This solution uses a package called staircase, which is part of the pandas ecosystem and exists to make working with step functions (i.e. piecewise constant) easier.
It will create a Stairs object (which represents a step function) from a pandas.Series, then bin across arbitrary DatetimeIndex values, then integrate.
This solution requires staircase 2.4.2 or above
setup
df = pd.DataFrame(
{
"Timestamp":pd.to_datetime(
[
"2022-05-05 06:00:05+02:00",
"2022-05-05 06:00:15+02:00",
"2022-05-05 06:00:25+02:00",
"2022-05-05 06:00:35+02:00",
"2022-05-05 06:00:45+02:00",
]
),
"Power [W]":[2.0, 1.2, 0.3, 4.3, 1.1]
}
)
solution
import staircase as sc
# create step function
sf = sc.Stairs.from_values(
initial_value=0,
values=df.set_index("Timestamp")["Power [W]"],
)
# optional: plot
sf.plot(style="hlines")
# create the bins (datetime index) over which you want to integrate
# using 20s intervals in this example
bins = pd.date_range(
"2022-05-05 06:00:00+02:00", "2022-05-05 06:01:00+02:00", freq="20s"
)
# slice into bins and integrate
result = sf.slice(bins).integral()
result will be a pandas.Series with an IntervalIndex and Timedelta values. The IntervalIndex retains timezone info, it just doesn't display it:
[2022-05-05 06:00:00, 2022-05-05 06:00:20) 0 days 00:00:26
[2022-05-05 06:00:20, 2022-05-05 06:00:40) 0 days 00:00:30.500000
[2022-05-05 06:00:40, 2022-05-05 06:01:00) 0 days 00:00:38
dtype: timedelta64[ns]
You can change the index to be the "left" values (and see this timezone info) like this:
result.index = result.index.left
You can change values to a float with division by an appropriate Timedelta. Eg to convert to minutes:
result/pd.Timedelta("1min")
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
I am trying to calculate the regression coefficient of weight for every animal_id and cycle_nr in my df:
animal_id
cycle_nr
feed_date
weight
1003
8
2020-02-06
221
1003
8
2020-02-10
226
1003
8
2020-02-14
230
1004
1
2020-02-20
231
1004
1
2020-02-21
243
What I tried using this source source:
import pandas as pd
import statsmodels.api as sm
def GroupRegress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'feed_date', ['weight'])
This code fails because my variable includes a date.
What I tried next:
I figured I could create a numeric column to use instead of my date column. I created a simple count_id column:
animal_id
cycle_nr
feed_date
weight
id
1003
8
2020-02-06
221
1
1003
8
2020-02-10
226
2
1003
8
2020-02-14
230
3
1004
1
2020-02-20
231
4
1004
1
2020-02-21
243
5
Then I ran my regression on this column
result = df.groupby(['animal_id', 'cycle_nr']).apply(GroupRegress, 'id', ['weight'])
The slope calculation looks good, but the intercept makes of course no sense.
Then I realized that this method is only useable when the interval between measurements is regular. In most cases the interval is 7 days, but somethimes it is 10, 14 or 21 days.
I dropped records where the interval was not 7 days and re-ran my regression...It works, but I hate that I have to throw away perfectly fine data.
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates. Any suggestions?
I'm wondering if there is a better approach where I can either include the date in my regression or can correct for the varying intervals of my dates.
If the feed dates are strings make a datetime Series using pandas.to_datetime.
Use that new Series to calculate the actual time difference between feedings
Use the resultant timedeltas in your regression instead of a linear fabricated sequence. The timedeltas have different attributes, (i.e. microseconds, days), that can be used depending on the resolution you need.
My first instinct would be to produce the Timedeltas for each group separately. The first feeding in each group would of course be time zero.
Making the Timedeltas may not even be necessary - there are probably datetime aware regression methods in Numpy or Scipy or maybe even Pandas - I imagine there would have to be, it is a common enough application.
Instead of Timedeltas the datetime Series could be converted to ordinal values for use in the regression.
df = pd.DataFrame(
{
"feed_date": [
"2020-02-06",
"2020-02-10",
"2020-02-14",
"2020-02-20",
"2020-02-21",
]
}
)
>>> q = pd.to_datetime(df.feed_date)
>>> q
0 2020-02-06
1 2020-02-10
2 2020-02-14
3 2020-02-20
4 2020-02-21
Name: feed_date, dtype: datetime64[ns]
>>> q.apply(pd.Timestamp.toordinal)
0 737461
1 737465
2 737469
3 737475
4 737476
Name: feed_date, dtype: int64
>>>
So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.
Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:
> df.head()
Date y wind temperature
1 2019-10-03 00:00:00 33 12 15
2 2019-10-03 01:00:00 10 5 6
3 2019-10-03 02:00:00 39 6 5
4 2019-10-03 03:00:00 60 13 4
5 2019-10-03 04:00:00 21 3 7
I want to predict y based on the wind and temperature. We then do a split something like this:
df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)
And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)
The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.
You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.