How can i standardize time series data?

How can i standardize time series data? - python

I'm working on OHLC trading data and i have different datasets with different ranges of prices. For example, on one dataset the price will range from 100 to 150, on another from 2 to 3, on another from 0.5 to 0.8 and so on, so very different magnitudes.
On each dataset, i'm looping through the data and for each point i'm computing the slope on the last five prices on each point, and for that i'm using np.polyfit().
Here is my code:
x = df['Date'].to_numpy()
y = df['Close'].to_numpy()
fits = []
for idx, j in enumerate(y):
arr_y = y[:idx]
arr_x = x[:idx]
p_y = arr_y[-5:]
p_x = arr_x[-5:]
if len(py) >= 4 and len(px) >= 4:
fit = np.polyfit(p_x, p_y, 1)
ang_coeff = fit[0]
intercept = fit[1]
fits.append(ang_coeff)
else:
fits.append(np.nan)
df['SLOPE'] = fits
Here is what the code does: loop through the prices, and for each price, calculate the slope based on the last five prices.
This code works well, but the problem is that, since i'm working with more dataset where prices are going to be a lot different on each dataset, it becomes hard for me to perform any kind of analysis. So a very high slope value on a dataset will be very low on another dataset. My question is: how can i standardize or normalize (i know they are two different things) this data? How can i process my slope values so that an "high" slope value on a dataset will be high on another dataset too?
Here is a sample of my outputs:
Date Close Slope
2021-01-17 00:00:00 34031.098338 29.572362
2021-01-17 04:00:00 34034.475090 20.097445
2021-01-17 08:00:00 34034.982351 8.655060
2021-01-17 12:00:00 34044.665386 3.914707
2021-01-17 16:00:00 34049.372571 4.538112
2021-01-17 20:00:00 34059.458965 4.673876
2021-01-18 00:00:00 34063.656831 6.435797
2021-01-18 04:00:00 34070.819559 7.214254
2021-01-18 08:00:00 34086.331298 6.659261
2021-01-18 12:00:00 34099.272005 8.527805
2021-01-18 16:00:00 34099.560423 10.230055
2021-01-18 20:00:00 34106.109568 10.025963
2021-01-19 00:00:00 34110.932662 8.380914
2021-01-19 04:00:00 34122.312205 5.604029
2021-01-19 08:00:00 34134.855812 5.745264
2021-01-19 12:00:00 34162.275141 8.679342
2021-01-19 16:00:00 34190.550778 13.625430
2021-01-19 20:00:00 34211.505419 19.919917
2021-01-20 00:00:00 34222.969489 23.408140
2021-01-20 04:00:00 34237.699255 22.545763
2021-01-20 08:00:00 34240.094551 18.326694
2021-01-20 12:00:00 34239.827609 12.528138
2021-01-20 16:00:00 34239.900596 7.376944
2021-01-20 20:00:00 34246.295214 3.599057
2021-01-21 00:00:00 34248.790292 1.699797
2021-01-21 04:00:00 34251.656251 2.385909
2021-01-21 08:00:00 34211.135875 3.254698
2021-01-21 12:00:00 34150.903010 -5.216841
2021-01-21 16:00:00 34127.857586 -22.843883
2021-01-21 20:00:00 34072.463679 -34.261865
2021-01-22 00:00:00 34018.425804 -44.166343
2021-01-22 04:00:00 33974.399053 -46.385947
2021-01-22 08:00:00 33946.475779 -46.243970
2021-01-22 12:00:00 33929.852159 -46.082824
2021-01-22 16:00:00 33927.598892 -35.717306
2021-01-22 20:00:00 33918.627401 -22.620072
2021-01-23 00:00:00 33905.044709 -13.042019
2021-01-23 04:00:00 33894.973038 -9.408690
2021-01-23 08:00:00 33861.417022 -9.231243
And a different dataset:
Date Close Slope
2021-02-18 04:00:00 0.492204 4.013722e-04
2021-02-18 08:00:00 0.492488 4.721365e-04
2021-02-18 12:00:00 0.493027 4.831912e-04
2021-02-18 16:00:00 0.493569 4.591663e-04
2021-02-18 20:00:00 0.494286 4.463141e-04
2021-02-19 00:00:00 0.494799 5.245110e-04
2021-02-19 04:00:00 0.495515 5.880476e-04
2021-02-19 08:00:00 0.496172 6.204948e-04
2021-02-19 12:00:00 0.496634 6.435782e-04
2021-02-19 16:00:00 0.497133 6.069365e-04
2021-02-19 20:00:00 0.497526 5.787601e-04
2021-02-20 00:00:00 0.497712 4.983345e-04
2021-02-20 04:00:00 0.497762 3.972312e-04
2021-02-20 08:00:00 0.497956 2.835458e-04
2021-02-20 12:00:00 0.498307 1.880521e-04
2021-02-20 16:00:00 0.498692 1.804976e-04
2021-02-20 20:00:00 0.498813 2.505608e-04
2021-02-21 00:00:00 0.499153 2.839021e-04
2021-02-21 04:00:00 0.499364 2.901245e-04
2021-02-21 08:00:00 0.499471 2.574213e-04
2021-02-21 12:00:00 0.499556 2.107408e-04
2021-02-21 16:00:00 0.499902 1.803125e-04
2021-02-21 20:00:00 0.500177 1.690260e-04
2021-02-22 00:00:00 0.500221 2.059057e-04
2021-02-22 04:00:00 0.501403 2.121462e-04
2021-02-22 08:00:00 0.502194 4.012434e-04
2021-02-22 12:00:00 0.502318 5.809102e-04
2021-02-22 16:00:00 0.502852 6.255775e-04
2021-02-22 20:00:00 0.503182 6.177676e-04
2021-02-23 00:00:00 0.503209 4.214821e-04
2021-02-23 04:00:00 0.503271 2.893487e-04
2021-02-23 08:00:00 0.502459 2.262497e-04
2021-02-23 12:00:00 0.502190 -6.951268e-05
2021-02-23 16:00:00 0.501697 -2.733434e-04
2021-02-23 20:00:00 0.501526 -4.105911e-04
2021-02-24 00:00:00 0.501506 -4.251799e-04
2021-02-24 04:00:00 0.501420 -2.571382e-04
2021-02-24 08:00:00 0.501332 -1.730550e-04
2021-02-24 12:00:00 0.501099 -8.359633e-05
2021-02-24 16:00:00 0.500684 -1.027447e-04
2021-02-24 20:00:00 0.500341 -1.962963e-04
2021-02-25 00:00:00 0.500027 -2.806065e-04
2021-02-25 04:00:00 0.499747 -3.368647e-04
2021-02-25 08:00:00 0.499428 -3.361539e-04
2021-02-25 12:00:00 0.499212 -3.105732e-04
2021-02-25 16:00:00 0.498883 -2.857117e-04
So these two datasets have very different Close values, which means the slope values are going to be completely different, so a very "high" slope value on the second dataset is nothing compared to the first dataset's slope values. Is there any way i can solve this? Do i have to apply some sort of normalization or standardization? Or do i need to use a different kind of calculation or metric? Thanks in advance!

The Close values can be scaled using sklearn's MinMaxScaler()
You can also simplify the polyfit loop by using Rolling.apply() with a window size of 5
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
for df in [df1, df2]:
df['Close'] = scaler.fit_transform(df['Close'].to_numpy().reshape(-1, 1))
df['Slope'] = df['Close'].rolling(5, center=True).apply(lambda x: np.polyfit(x.index, x, 1)[0])
>>> df1
Date Close Slope
0 2021-01-17 00:00:00 0.434814 NaN
1 2021-01-17 04:00:00 0.443467 NaN
2 2021-01-17 08:00:00 0.444766 0.011977
3 2021-01-17 12:00:00 0.469580 0.016492
4 2021-01-17 16:00:00 0.481642 0.018487
...
34 2021-01-22 16:00:00 0.169593 -0.024110
35 2021-01-22 20:00:00 0.146603 -0.023655
36 2021-01-23 00:00:00 0.111797 -0.039980
37 2021-01-23 04:00:00 0.085988 NaN
38 2021-01-23 08:00:00 0.000000 NaN
>>> df2
Date Close Slope
0 2021-02-18 04:00:00 0.000000 NaN
1 2021-02-18 08:00:00 0.025662 NaN
2 2021-02-18 12:00:00 0.074365 0.047393
3 2021-02-18 16:00:00 0.123340 0.053140
4 2021-02-18 20:00:00 0.188127 0.056077
...
41 2021-02-25 00:00:00 0.706876 -0.028065
42 2021-02-25 04:00:00 0.681576 -0.025815
43 2021-02-25 08:00:00 0.652751 -0.025508
44 2021-02-25 12:00:00 0.633234 NaN
45 2021-02-25 16:00:00 0.603506 NaN

Recommend you adjust the scale by first calculating the Average True Range (ATR see https://www.investopedia.com/terms/a/atr.asp) of one of the datasets and figure a reasonable scale to get a representative slope for that one. Then for other datasets calculate the ratio of their ATR to the standardized dataset and adjust the slope by that ratio.
For example if a new dataset has an ATR which is only a tenth of your "standard" ATR, then you multiply its slope measurements by 10 to put it to the same scale.

I recommend you use unit length scaling (scaling to unit length) or unit normal scaling (standardization) if you want the series' to maintain their statistical properties but be scale free. It doesn't matter which one you use since you're just looking at slopes, and the fitted slopes between the two methods are identical (Montgomery, et. al., section 3.9).
Essentially, take the z-score of all of your regressors and the response variable, for UNS, and fit the transformed data without an intercept. For ULS, take the mean deviated regressor and response values divided by the square root of the corrected sum of squares values.
There are other methods you can try. They fall under the heading of Feature Scaling, and include min-max normalization and mean normalization Wikipedia, 2021.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). John Wiley & Sons, Inc.

Related

Efficently slicing non-integer multilevel indexes with integers in Pandas

The following code generates a sample DataFrame with a multilevel index. The first level is a string, the second level is a datetime.
Script
import pandas as pd
from datetime import datetime
import random
df = pd.DataFrame(columns=['network','time','active_clients','throughput','speed'])
networks = ['ALPHA','BETA','GAMMA']
times = pd.date_range(datetime.strptime('2021-01-01 00:00:00','%Y-%m-%d %H:%M:%S'),datetime.strptime('2021-01-01 12:00:00','%Y-%m-%d %H:%M:%S'),7).tolist()
for n in networks:
for t in times:
df = df.append({'network':n,'time':t,'active_clients':random.randint(10,30),'throughput':random.randint(1500,5000),'speed':random.randint(10000,12000)},ignore_index=True)
df.set_index(['network','time'],inplace=True)
print(df.to_string())
Output
active_clients throughput speed
network time
ALPHA 2021-01-01 00:00:00 16 4044 11023
2021-01-01 02:00:00 17 2966 10933
2021-01-01 04:00:00 10 4649 11981
2021-01-01 06:00:00 23 3629 10113
2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 00:00:00 17 3073 11798
2021-01-01 02:00:00 20 1941 10640
2021-01-01 04:00:00 17 1980 11869
2021-01-01 06:00:00 23 3346 10002
2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 00:00:00 21 4366 11587
2021-01-01 02:00:00 22 3404 11669
2021-01-01 04:00:00 20 1608 10344
2021-01-01 06:00:00 28 1849 10278
2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
For each item in the first level, I want to select the last three records in the second level. The catch is that I don't know the datetime values, so I need to select by integer-based index location instead. What's the most efficient way of slicing the DataFrame to achieve the following.
Desired output
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
My attempts
Returns the full dataframe:
df_sel = df.iloc[:,-3:]
Raises an error because loc doesn't support using integer values on datetime objects:
df_sel = df.loc[:,-3:]
Returns the last three entries in the second level, but only for the last entry in the first level:
df_sel = df.loc[:].iloc[-3:]

I have 2 methods to solve this problem:
Method 1:
As it mentions from the first comment from Quang Hoang, you can use groupby to do this, which I believe has the shortest code:
df.groupby(level=0).tail(3)
Method 2:
You can also slice each one in networks then concat them:
pd.concat([df.loc[[i]][-3:] for i in networks])
Both of these 2 methods will output the result you want:

Another method is to do some reshaping:
df.unstack(0).iloc[-3:].stack().swaplevel(0,1).sort_index()
Output:
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 26 4081 11325
2021-01-01 10:00:00 13 3370 10716
2021-01-01 12:00:00 13 3691 10737
BETA 2021-01-01 08:00:00 28 2105 10465
2021-01-01 10:00:00 21 2444 10158
2021-01-01 12:00:00 24 1947 11226
GAMMA 2021-01-01 08:00:00 13 1850 10288
2021-01-01 10:00:00 23 2241 11521
2021-01-01 12:00:00 30 3515 11138
Details:
unstack the outer most index level, level=0
Use, iloc to select the last three records in the dataframe
stack that level back to the index swaplevel and sort_index

Pandas, insert datetime values that increase one hour for each row

I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row.
Is there an intelligent way of doing this?

You can do:
x=pd.to_datetime('2021-01-01 00:00')
y=pd.to_datetime('2021-01-07 23:59')
pd.Series(pd.date_range(x,y,freq='H'))
output:
pd.Series(pd.date_range(x,y,freq='H'))
Out[153]:
0 2021-01-01 00:00:00
1 2021-01-01 01:00:00
2 2021-01-01 02:00:00
3 2021-01-01 03:00:00
4 2021-01-01 04:00:00
163 2021-01-07 19:00:00
164 2021-01-07 20:00:00
165 2021-01-07 21:00:00
166 2021-01-07 22:00:00
167 2021-01-07 23:00:00
Length: 168, dtype: datetime64[ns]

How to extract hourly data from a df in python?

I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks

Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5

If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

Select groups using slicing based on the group index in pandas DataFrame

I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?

I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]

scipy UnivariateSpline always return linear-ish spline when plotting

I have following set of data (pandas.DataFrame) which I would like to use scipy.interpolate.UnivariateSpline to fit. Let's call the data data.
Date
2018-04-02 09:00:00 16249
2018-04-02 10:00:00 45473
2018-04-02 11:00:00 32050
2018-04-02 12:00:00 35898
2018-04-02 13:00:00 21577
2018-04-02 14:00:00 30545
2018-04-02 15:00:00 60925
2018-04-02 16:00:00 47124
2018-04-03 09:00:00 18534
2018-04-03 10:00:00 36064
2018-04-03 11:00:00 32387
2018-04-03 12:00:00 15903
2018-04-03 13:00:00 22291
2018-04-03 14:00:00 26367
2018-04-03 15:00:00 66269
2018-04-03 16:00:00 38478
2018-04-04 09:00:00 15803
2018-04-04 10:00:00 22511
2018-04-04 11:00:00 33123
2018-04-04 12:00:00 21000
2018-04-04 13:00:00 23132
2018-04-04 14:00:00 39270
2018-04-04 15:00:00 102544
2018-04-04 16:00:00 143421
2018-04-04 17:00:00 200
2018-04-05 09:00:00 23377
2018-04-05 10:00:00 52089
2018-04-05 11:00:00 99298
2018-04-05 12:00:00 24627
2018-04-05 13:00:00 33467
2018-04-05 14:00:00 26498
2018-04-05 15:00:00 114794
2018-04-05 16:00:00 44904
2018-04-06 09:00:00 12180
2018-04-06 10:00:00 41658
2018-04-06 11:00:00 64066
2018-04-06 12:00:00 12517
2018-04-06 13:00:00 12610
2018-04-06 14:00:00 43544
2018-04-06 15:00:00 65533
2018-04-06 16:00:00 123885
2018-04-09 09:00:00 13425
2018-04-09 10:00:00 38354
2018-04-09 11:00:00 59491
2018-04-09 12:00:00 21402
2018-04-09 13:00:00 24550
2018-04-09 14:00:00 25189
2018-04-09 15:00:00 67751
2018-04-09 16:00:00 16071
2018-04-10 09:00:00 35587
2018-04-10 10:00:00 58667
2018-04-10 11:00:00 41831
2018-04-10 12:00:00 35196
2018-04-10 13:00:00 22611
2018-04-10 14:00:00 23070
2018-04-10 15:00:00 40819
2018-04-10 16:00:00 20337
2018-04-11 09:00:00 7962
2018-04-11 10:00:00 23982
2018-04-11 11:00:00 21794
2018-04-11 12:00:00 16835
2018-04-11 13:00:00 16821
2018-04-11 14:00:00 13270
2018-04-11 15:00:00 34954
2018-04-11 16:00:00 15772
2018-04-12 09:00:00 8587
2018-04-12 10:00:00 47950
2018-04-12 11:00:00 24742
2018-04-12 12:00:00 16743
2018-04-12 13:00:00 21917
2018-04-12 14:00:00 43272
2018-04-12 15:00:00 50630
2018-04-12 16:00:00 104656
2018-04-13 09:00:00 15282
2018-04-13 10:00:00 30304
2018-04-13 11:00:00 65737
2018-04-13 12:00:00 17467
2018-04-13 13:00:00 10439
2018-04-13 14:00:00 19836
2018-04-13 15:00:00 52051
2018-04-13 16:00:00 99462
what I have done so far is:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate as interp
x = [i for i in range(1, data.size+1)] # this gives x as an array from 1 to 82.
spl = interp.UnivariateSpline(x, data.values, s=0.5)
xx = np.linspace(min(x), max(x), 1000) # 1000 is an arbitrary number here.
plt.plot(x, data.values, 'bo')
plt.plot(xx, spl(xx), 'r')
plt.show()
# the plot is below and it seems to be very linear and does not look like a cubic spline at all. Cubic Spline is the default.
when I run spl against x, others remain unchanged, which is:
plt.plot(x, spl(x), 'r')
I get following:
the only different is the y axis is topped at 14,000, which seems to mean the previous plot showed some degree of curvature. (or not?)
I am not sure what I am missing here but I apparently missed something. I am still very new to spline fitting in python generally.
can you tell me how I can correctly spline fit my time series above?
EDIT
upon comment from you, I wanted to add another plot to hopefully explain myself a bit better. I didn't really mean it is linear but I couldn't find a better word. To illustrate,
xxx = [10,20,40,60,80]
plt(x, data.values, 'bo')
plt(xx, sp(xx), 'r')
plt.show()
I think below plot looks reasonably linear-ish in my sense. I am guessing, probably my question should be, how scipy.UnivariateSpline really works?
does it only show the plot for the values evaluated at the points we supplied (e.g. for this plot it is xxx) ?
I was expecting a much smoother plot with decent curvature demonstrated. this question's answer is showing a plot that I would expect; it looks more like a plot that piece-wise cubic functions would generate, whereas mine looks, to me, and compared to that plot, linear-ish (or first order if it is more appropriate.)

The data set you have looks more like Rexthor, the dog-bearer than something that a smooth curve can follow. You don't have an issue with SciPy; you have an issue with data.
By increasing the parameter s you can get progressively smoother plots that deviate further and further from the data, eventually approaching the cubic polynomial that is the "best" least-squares fit for the data. But here "best" means "very bad, probably worthless". A smooth curve can be useful to display a pattern that the data already follows. If the data does not follow a smooth pattern, one should not draw a curve for the sake of drawing. The data points on the first plot should just be presented as is, without any connecting or approximating curves.
The data comes from hourly reading taken from 9:00 to 16:00 (with one stray 17:00 value mixed it - throw it out.) This structure matters. Do not pretend that Tuesday 9:00 is what happens one hour after Monday 16:00.
The data can be meaningfully summarized by daily totals
Day Total
2018-04-02 289841
2018-04-03 256293
2018-04-04 401004
2018-04-05 419054
2018-04-06 375993
2018-04-09 266233
2018-04-10 278118
2018-04-11 151390
2018-04-12 318497
2018-04-13 310578
and by hourly averages (average number of events at 9:00, across all days, etc).
Hour Average
9:00:00 16698.6
10:00:00 39705.2
11:00:00 47451.9
12:00:00 21758.8
13:00:00 20941.5
14:00:00 29086.1
15:00:00 65627
16:00:00 65411
In these things we can maybe observe some pattern. Here is the hourly one:
hourly_averages = np.array([16698.6, 39705.2, 47451.9, 21758.8, 20941.5, 29086.1, 65627, 65411])
hours = np.arange(9, 17)
hourly_s = 0.1*np.diff(hourly_averages).max()**2
hourly_spline = interp.UnivariateSpline(hours, hourly_averages, s=hourly_s)
xx = np.linspace(min(hours), max(hours), 1000) # 1000 is an arbitrary number here.
plt.plot(hours, hourly_averages, 'bo')
plt.plot(xx, hourly_spline(xx), 'r')
plt.show()
The curve shows the lunch break and the end-of-day rush. My choice of s as 0.1*np.diff(hourly_averages).max()**2 is not canonical, but it recognizes the fact that s scales as the square of the residuals. (Documentation). I'll use the same choice for daily averages:
daily_totals = np.array([289841, 256293, 401004, 419054, 375993, 266233, 278118, 151390, 318497, 310578])
days = np.arange(len(daily_totals))
daily_s = 0.1*np.diff(daily_totals).max()**2
daily_spline = interp.UnivariateSpline(days, daily_totals, s=daily_s)
xx = np.linspace(min(days), max(days), 1000) # 1000 is an arbitrary number here.
plt.plot(days, daily_totals, 'bo')
plt.plot(xx, daily_spline(xx), 'r')
plt.show()
This is less useful. Maybe we need a longer period of observations. Maybe we should not pretend that Monday comes after Friday. Maybe averages should be taken for each day of week to uncover a weekly pattern, but with only two weeks there is not enough to play with.
Technical details: the method UnivariateSpline chooses as few knots as possible so that a certain weighed sum of squared deviations from the data is at most s. With large s this will mean very few knots, until none remain, and we get a single cubic polynomial. How large s needs to be depends on the amount of oscillation in the vertical direction, which is extremely high in this example.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i standardize time series data? - python

Related

Efficently slicing non-integer multilevel indexes with integers in Pandas

Pandas, insert datetime values that increase one hour for each row

How to extract hourly data from a df in python?

Select groups using slicing based on the group index in pandas DataFrame

scipy UnivariateSpline always return linear-ish spline when plotting

Categories

Resources