How can i standardize time series data? - python
I'm working on OHLC trading data and i have different datasets with different ranges of prices. For example, on one dataset the price will range from 100 to 150, on another from 2 to 3, on another from 0.5 to 0.8 and so on, so very different magnitudes.
On each dataset, i'm looping through the data and for each point i'm computing the slope on the last five prices on each point, and for that i'm using np.polyfit().
Here is my code:
x = df['Date'].to_numpy()
y = df['Close'].to_numpy()
fits = []
for idx, j in enumerate(y):
arr_y = y[:idx]
arr_x = x[:idx]
p_y = arr_y[-5:]
p_x = arr_x[-5:]
if len(py) >= 4 and len(px) >= 4:
fit = np.polyfit(p_x, p_y, 1)
ang_coeff = fit[0]
intercept = fit[1]
fits.append(ang_coeff)
else:
fits.append(np.nan)
df['SLOPE'] = fits
Here is what the code does: loop through the prices, and for each price, calculate the slope based on the last five prices.
This code works well, but the problem is that, since i'm working with more dataset where prices are going to be a lot different on each dataset, it becomes hard for me to perform any kind of analysis. So a very high slope value on a dataset will be very low on another dataset. My question is: how can i standardize or normalize (i know they are two different things) this data? How can i process my slope values so that an "high" slope value on a dataset will be high on another dataset too?
Here is a sample of my outputs:
Date Close Slope
2021-01-17 00:00:00 34031.098338 29.572362
2021-01-17 04:00:00 34034.475090 20.097445
2021-01-17 08:00:00 34034.982351 8.655060
2021-01-17 12:00:00 34044.665386 3.914707
2021-01-17 16:00:00 34049.372571 4.538112
2021-01-17 20:00:00 34059.458965 4.673876
2021-01-18 00:00:00 34063.656831 6.435797
2021-01-18 04:00:00 34070.819559 7.214254
2021-01-18 08:00:00 34086.331298 6.659261
2021-01-18 12:00:00 34099.272005 8.527805
2021-01-18 16:00:00 34099.560423 10.230055
2021-01-18 20:00:00 34106.109568 10.025963
2021-01-19 00:00:00 34110.932662 8.380914
2021-01-19 04:00:00 34122.312205 5.604029
2021-01-19 08:00:00 34134.855812 5.745264
2021-01-19 12:00:00 34162.275141 8.679342
2021-01-19 16:00:00 34190.550778 13.625430
2021-01-19 20:00:00 34211.505419 19.919917
2021-01-20 00:00:00 34222.969489 23.408140
2021-01-20 04:00:00 34237.699255 22.545763
2021-01-20 08:00:00 34240.094551 18.326694
2021-01-20 12:00:00 34239.827609 12.528138
2021-01-20 16:00:00 34239.900596 7.376944
2021-01-20 20:00:00 34246.295214 3.599057
2021-01-21 00:00:00 34248.790292 1.699797
2021-01-21 04:00:00 34251.656251 2.385909
2021-01-21 08:00:00 34211.135875 3.254698
2021-01-21 12:00:00 34150.903010 -5.216841
2021-01-21 16:00:00 34127.857586 -22.843883
2021-01-21 20:00:00 34072.463679 -34.261865
2021-01-22 00:00:00 34018.425804 -44.166343
2021-01-22 04:00:00 33974.399053 -46.385947
2021-01-22 08:00:00 33946.475779 -46.243970
2021-01-22 12:00:00 33929.852159 -46.082824
2021-01-22 16:00:00 33927.598892 -35.717306
2021-01-22 20:00:00 33918.627401 -22.620072
2021-01-23 00:00:00 33905.044709 -13.042019
2021-01-23 04:00:00 33894.973038 -9.408690
2021-01-23 08:00:00 33861.417022 -9.231243
And a different dataset:
Date Close Slope
2021-02-18 04:00:00 0.492204 4.013722e-04
2021-02-18 08:00:00 0.492488 4.721365e-04
2021-02-18 12:00:00 0.493027 4.831912e-04
2021-02-18 16:00:00 0.493569 4.591663e-04
2021-02-18 20:00:00 0.494286 4.463141e-04
2021-02-19 00:00:00 0.494799 5.245110e-04
2021-02-19 04:00:00 0.495515 5.880476e-04
2021-02-19 08:00:00 0.496172 6.204948e-04
2021-02-19 12:00:00 0.496634 6.435782e-04
2021-02-19 16:00:00 0.497133 6.069365e-04
2021-02-19 20:00:00 0.497526 5.787601e-04
2021-02-20 00:00:00 0.497712 4.983345e-04
2021-02-20 04:00:00 0.497762 3.972312e-04
2021-02-20 08:00:00 0.497956 2.835458e-04
2021-02-20 12:00:00 0.498307 1.880521e-04
2021-02-20 16:00:00 0.498692 1.804976e-04
2021-02-20 20:00:00 0.498813 2.505608e-04
2021-02-21 00:00:00 0.499153 2.839021e-04
2021-02-21 04:00:00 0.499364 2.901245e-04
2021-02-21 08:00:00 0.499471 2.574213e-04
2021-02-21 12:00:00 0.499556 2.107408e-04
2021-02-21 16:00:00 0.499902 1.803125e-04
2021-02-21 20:00:00 0.500177 1.690260e-04
2021-02-22 00:00:00 0.500221 2.059057e-04
2021-02-22 04:00:00 0.501403 2.121462e-04
2021-02-22 08:00:00 0.502194 4.012434e-04
2021-02-22 12:00:00 0.502318 5.809102e-04
2021-02-22 16:00:00 0.502852 6.255775e-04
2021-02-22 20:00:00 0.503182 6.177676e-04
2021-02-23 00:00:00 0.503209 4.214821e-04
2021-02-23 04:00:00 0.503271 2.893487e-04
2021-02-23 08:00:00 0.502459 2.262497e-04
2021-02-23 12:00:00 0.502190 -6.951268e-05
2021-02-23 16:00:00 0.501697 -2.733434e-04
2021-02-23 20:00:00 0.501526 -4.105911e-04
2021-02-24 00:00:00 0.501506 -4.251799e-04
2021-02-24 04:00:00 0.501420 -2.571382e-04
2021-02-24 08:00:00 0.501332 -1.730550e-04
2021-02-24 12:00:00 0.501099 -8.359633e-05
2021-02-24 16:00:00 0.500684 -1.027447e-04
2021-02-24 20:00:00 0.500341 -1.962963e-04
2021-02-25 00:00:00 0.500027 -2.806065e-04
2021-02-25 04:00:00 0.499747 -3.368647e-04
2021-02-25 08:00:00 0.499428 -3.361539e-04
2021-02-25 12:00:00 0.499212 -3.105732e-04
2021-02-25 16:00:00 0.498883 -2.857117e-04
So these two datasets have very different Close values, which means the slope values are going to be completely different, so a very "high" slope value on the second dataset is nothing compared to the first dataset's slope values. Is there any way i can solve this? Do i have to apply some sort of normalization or standardization? Or do i need to use a different kind of calculation or metric? Thanks in advance!
The Close values can be scaled using sklearn's MinMaxScaler()
You can also simplify the polyfit loop by using Rolling.apply() with a window size of 5
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
for df in [df1, df2]:
df['Close'] = scaler.fit_transform(df['Close'].to_numpy().reshape(-1, 1))
df['Slope'] = df['Close'].rolling(5, center=True).apply(lambda x: np.polyfit(x.index, x, 1)[0])
>>> df1
Date Close Slope
0 2021-01-17 00:00:00 0.434814 NaN
1 2021-01-17 04:00:00 0.443467 NaN
2 2021-01-17 08:00:00 0.444766 0.011977
3 2021-01-17 12:00:00 0.469580 0.016492
4 2021-01-17 16:00:00 0.481642 0.018487
...
34 2021-01-22 16:00:00 0.169593 -0.024110
35 2021-01-22 20:00:00 0.146603 -0.023655
36 2021-01-23 00:00:00 0.111797 -0.039980
37 2021-01-23 04:00:00 0.085988 NaN
38 2021-01-23 08:00:00 0.000000 NaN
>>> df2
Date Close Slope
0 2021-02-18 04:00:00 0.000000 NaN
1 2021-02-18 08:00:00 0.025662 NaN
2 2021-02-18 12:00:00 0.074365 0.047393
3 2021-02-18 16:00:00 0.123340 0.053140
4 2021-02-18 20:00:00 0.188127 0.056077
...
41 2021-02-25 00:00:00 0.706876 -0.028065
42 2021-02-25 04:00:00 0.681576 -0.025815
43 2021-02-25 08:00:00 0.652751 -0.025508
44 2021-02-25 12:00:00 0.633234 NaN
45 2021-02-25 16:00:00 0.603506 NaN
Recommend you adjust the scale by first calculating the Average True Range (ATR see https://www.investopedia.com/terms/a/atr.asp) of one of the datasets and figure a reasonable scale to get a representative slope for that one. Then for other datasets calculate the ratio of their ATR to the standardized dataset and adjust the slope by that ratio.
For example if a new dataset has an ATR which is only a tenth of your "standard" ATR, then you multiply its slope measurements by 10 to put it to the same scale.
I recommend you use unit length scaling (scaling to unit length) or unit normal scaling (standardization) if you want the series' to maintain their statistical properties but be scale free. It doesn't matter which one you use since you're just looking at slopes, and the fitted slopes between the two methods are identical (Montgomery, et. al., section 3.9).
Essentially, take the z-score of all of your regressors and the response variable, for UNS, and fit the transformed data without an intercept. For ULS, take the mean deviated regressor and response values divided by the square root of the corrected sum of squares values.
There are other methods you can try. They fall under the heading of Feature Scaling, and include min-max normalization and mean normalization Wikipedia, 2021.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). John Wiley & Sons, Inc.
Related
Efficently slicing non-integer multilevel indexes with integers in Pandas
The following code generates a sample DataFrame with a multilevel index. The first level is a string, the second level is a datetime. Script import pandas as pd from datetime import datetime import random df = pd.DataFrame(columns=['network','time','active_clients','throughput','speed']) networks = ['ALPHA','BETA','GAMMA'] times = pd.date_range(datetime.strptime('2021-01-01 00:00:00','%Y-%m-%d %H:%M:%S'),datetime.strptime('2021-01-01 12:00:00','%Y-%m-%d %H:%M:%S'),7).tolist() for n in networks: for t in times: df = df.append({'network':n,'time':t,'active_clients':random.randint(10,30),'throughput':random.randint(1500,5000),'speed':random.randint(10000,12000)},ignore_index=True) df.set_index(['network','time'],inplace=True) print(df.to_string()) Output active_clients throughput speed network time ALPHA 2021-01-01 00:00:00 16 4044 11023 2021-01-01 02:00:00 17 2966 10933 2021-01-01 04:00:00 10 4649 11981 2021-01-01 06:00:00 23 3629 10113 2021-01-01 08:00:00 30 2520 11159 2021-01-01 10:00:00 10 4200 11309 2021-01-01 12:00:00 16 3878 11366 BETA 2021-01-01 00:00:00 17 3073 11798 2021-01-01 02:00:00 20 1941 10640 2021-01-01 04:00:00 17 1980 11869 2021-01-01 06:00:00 23 3346 10002 2021-01-01 08:00:00 10 1952 10063 2021-01-01 10:00:00 28 3788 11047 2021-01-01 12:00:00 24 4993 10487 GAMMA 2021-01-01 00:00:00 21 4366 11587 2021-01-01 02:00:00 22 3404 11669 2021-01-01 04:00:00 20 1608 10344 2021-01-01 06:00:00 28 1849 10278 2021-01-01 08:00:00 14 3229 11925 2021-01-01 10:00:00 21 3408 10411 2021-01-01 12:00:00 12 1799 10492 For each item in the first level, I want to select the last three records in the second level. The catch is that I don't know the datetime values, so I need to select by integer-based index location instead. What's the most efficient way of slicing the DataFrame to achieve the following. Desired output active_clients throughput speed network time ALPHA 2021-01-01 08:00:00 30 2520 11159 2021-01-01 10:00:00 10 4200 11309 2021-01-01 12:00:00 16 3878 11366 BETA 2021-01-01 08:00:00 10 1952 10063 2021-01-01 10:00:00 28 3788 11047 2021-01-01 12:00:00 24 4993 10487 GAMMA 2021-01-01 08:00:00 14 3229 11925 2021-01-01 10:00:00 21 3408 10411 2021-01-01 12:00:00 12 1799 10492 My attempts Returns the full dataframe: df_sel = df.iloc[:,-3:] Raises an error because loc doesn't support using integer values on datetime objects: df_sel = df.loc[:,-3:] Returns the last three entries in the second level, but only for the last entry in the first level: df_sel = df.loc[:].iloc[-3:]
I have 2 methods to solve this problem: Method 1: As it mentions from the first comment from Quang Hoang, you can use groupby to do this, which I believe has the shortest code: df.groupby(level=0).tail(3) Method 2: You can also slice each one in networks then concat them: pd.concat([df.loc[[i]][-3:] for i in networks]) Both of these 2 methods will output the result you want:
Another method is to do some reshaping: df.unstack(0).iloc[-3:].stack().swaplevel(0,1).sort_index() Output: active_clients throughput speed network time ALPHA 2021-01-01 08:00:00 26 4081 11325 2021-01-01 10:00:00 13 3370 10716 2021-01-01 12:00:00 13 3691 10737 BETA 2021-01-01 08:00:00 28 2105 10465 2021-01-01 10:00:00 21 2444 10158 2021-01-01 12:00:00 24 1947 11226 GAMMA 2021-01-01 08:00:00 13 1850 10288 2021-01-01 10:00:00 23 2241 11521 2021-01-01 12:00:00 30 3515 11138 Details: unstack the outer most index level, level=0 Use, iloc to select the last three records in the dataframe stack that level back to the index swaplevel and sort_index
Pandas, insert datetime values that increase one hour for each row
I made predictions with an Arima model that predict the next 168 hours (one week) of cars on the road. I also want to add a column called "datetime" that starts with 00:00 01-01-2021 and increases with one hour for each row. Is there an intelligent way of doing this?
You can do: x=pd.to_datetime('2021-01-01 00:00') y=pd.to_datetime('2021-01-07 23:59') pd.Series(pd.date_range(x,y,freq='H')) output: pd.Series(pd.date_range(x,y,freq='H')) Out[153]: 0 2021-01-01 00:00:00 1 2021-01-01 01:00:00 2 2021-01-01 02:00:00 3 2021-01-01 03:00:00 4 2021-01-01 04:00:00 163 2021-01-07 19:00:00 164 2021-01-07 20:00:00 165 2021-01-07 21:00:00 166 2021-01-07 22:00:00 167 2021-01-07 23:00:00 Length: 168, dtype: datetime64[ns]
How to extract hourly data from a df in python?
I have the following df dates Final 2020-01-01 00:15:00 94.7 2020-01-01 00:30:00 94.1 2020-01-01 00:45:00 94.1 2020-01-01 01:00:00 95.0 2020-01-01 01:15:00 96.6 2020-01-01 01:30:00 98.4 2020-01-01 01:45:00 99.8 2020-01-01 02:00:00 99.8 2020-01-01 02:15:00 98.0 2020-01-01 02:30:00 95.1 2020-01-01 02:45:00 91.9 2020-01-01 03:00:00 89.5 The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins. Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values Expected output dates Final 2020-01-01 01:00:00 95.0 2020-01-01 02:00:00 99.8 2020-01-01 03:00:00 89.5 With the last row being 2021-01-01 00:00:00 95.6 How can this be done? Thanks
Use Series.dt.minute to performance a boolean indexing: df_filtered = df.loc[df['dates'].dt.minute.eq(0)] #if necessary #df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)] print(df_filtered) dates Final 3 2020-01-01 01:00:00 95.0 7 2020-01-01 02:00:00 99.8 11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).
Select groups using slicing based on the group index in pandas DataFrame
I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like df: user_id date 0 2019-04-13 02:00:00 0 2019-04-13 03:00:00 3 2019-02-18 22:00:00 3 2019-02-18 23:00:00 3 2019-02-19 00:00:00 3 2019-02-19 02:00:00 3 2019-02-19 03:00:00 3 2019-02-19 04:00:00 8 2019-04-05 04:00:00 8 2019-04-05 05:00:00 8 2019-04-05 06:00:00 8 2019-04-05 15:00:00 15 2019-04-28 19:00:00 15 2019-04-28 20:00:00 15 2019-04-29 01:00:00 23 2019-06-24 02:00:00 23 2019-06-24 05:00:00 23 2019-06-24 06:00:00 24 2019-03-27 12:00:00 24 2019-03-27 13:00:00 What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this: df.groupby('user_id').iloc[:3] I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?
I found a way based on crayxt's answer: df[df['user_id'].isin(df['user_id'].unique()[:3])]
scipy UnivariateSpline always return linear-ish spline when plotting
I have following set of data (pandas.DataFrame) which I would like to use scipy.interpolate.UnivariateSpline to fit. Let's call the data data. Date 2018-04-02 09:00:00 16249 2018-04-02 10:00:00 45473 2018-04-02 11:00:00 32050 2018-04-02 12:00:00 35898 2018-04-02 13:00:00 21577 2018-04-02 14:00:00 30545 2018-04-02 15:00:00 60925 2018-04-02 16:00:00 47124 2018-04-03 09:00:00 18534 2018-04-03 10:00:00 36064 2018-04-03 11:00:00 32387 2018-04-03 12:00:00 15903 2018-04-03 13:00:00 22291 2018-04-03 14:00:00 26367 2018-04-03 15:00:00 66269 2018-04-03 16:00:00 38478 2018-04-04 09:00:00 15803 2018-04-04 10:00:00 22511 2018-04-04 11:00:00 33123 2018-04-04 12:00:00 21000 2018-04-04 13:00:00 23132 2018-04-04 14:00:00 39270 2018-04-04 15:00:00 102544 2018-04-04 16:00:00 143421 2018-04-04 17:00:00 200 2018-04-05 09:00:00 23377 2018-04-05 10:00:00 52089 2018-04-05 11:00:00 99298 2018-04-05 12:00:00 24627 2018-04-05 13:00:00 33467 2018-04-05 14:00:00 26498 2018-04-05 15:00:00 114794 2018-04-05 16:00:00 44904 2018-04-06 09:00:00 12180 2018-04-06 10:00:00 41658 2018-04-06 11:00:00 64066 2018-04-06 12:00:00 12517 2018-04-06 13:00:00 12610 2018-04-06 14:00:00 43544 2018-04-06 15:00:00 65533 2018-04-06 16:00:00 123885 2018-04-09 09:00:00 13425 2018-04-09 10:00:00 38354 2018-04-09 11:00:00 59491 2018-04-09 12:00:00 21402 2018-04-09 13:00:00 24550 2018-04-09 14:00:00 25189 2018-04-09 15:00:00 67751 2018-04-09 16:00:00 16071 2018-04-10 09:00:00 35587 2018-04-10 10:00:00 58667 2018-04-10 11:00:00 41831 2018-04-10 12:00:00 35196 2018-04-10 13:00:00 22611 2018-04-10 14:00:00 23070 2018-04-10 15:00:00 40819 2018-04-10 16:00:00 20337 2018-04-11 09:00:00 7962 2018-04-11 10:00:00 23982 2018-04-11 11:00:00 21794 2018-04-11 12:00:00 16835 2018-04-11 13:00:00 16821 2018-04-11 14:00:00 13270 2018-04-11 15:00:00 34954 2018-04-11 16:00:00 15772 2018-04-12 09:00:00 8587 2018-04-12 10:00:00 47950 2018-04-12 11:00:00 24742 2018-04-12 12:00:00 16743 2018-04-12 13:00:00 21917 2018-04-12 14:00:00 43272 2018-04-12 15:00:00 50630 2018-04-12 16:00:00 104656 2018-04-13 09:00:00 15282 2018-04-13 10:00:00 30304 2018-04-13 11:00:00 65737 2018-04-13 12:00:00 17467 2018-04-13 13:00:00 10439 2018-04-13 14:00:00 19836 2018-04-13 15:00:00 52051 2018-04-13 16:00:00 99462 what I have done so far is: import matplotlib.pyplot as plt import numpy as np import scipy.interpolate as interp x = [i for i in range(1, data.size+1)] # this gives x as an array from 1 to 82. spl = interp.UnivariateSpline(x, data.values, s=0.5) xx = np.linspace(min(x), max(x), 1000) # 1000 is an arbitrary number here. plt.plot(x, data.values, 'bo') plt.plot(xx, spl(xx), 'r') plt.show() # the plot is below and it seems to be very linear and does not look like a cubic spline at all. Cubic Spline is the default. when I run spl against x, others remain unchanged, which is: plt.plot(x, spl(x), 'r') I get following: the only different is the y axis is topped at 14,000, which seems to mean the previous plot showed some degree of curvature. (or not?) I am not sure what I am missing here but I apparently missed something. I am still very new to spline fitting in python generally. can you tell me how I can correctly spline fit my time series above? EDIT upon comment from you, I wanted to add another plot to hopefully explain myself a bit better. I didn't really mean it is linear but I couldn't find a better word. To illustrate, xxx = [10,20,40,60,80] plt(x, data.values, 'bo') plt(xx, sp(xx), 'r') plt.show() I think below plot looks reasonably linear-ish in my sense. I am guessing, probably my question should be, how scipy.UnivariateSpline really works? does it only show the plot for the values evaluated at the points we supplied (e.g. for this plot it is xxx) ? I was expecting a much smoother plot with decent curvature demonstrated. this question's answer is showing a plot that I would expect; it looks more like a plot that piece-wise cubic functions would generate, whereas mine looks, to me, and compared to that plot, linear-ish (or first order if it is more appropriate.)
The data set you have looks more like Rexthor, the dog-bearer than something that a smooth curve can follow. You don't have an issue with SciPy; you have an issue with data. By increasing the parameter s you can get progressively smoother plots that deviate further and further from the data, eventually approaching the cubic polynomial that is the "best" least-squares fit for the data. But here "best" means "very bad, probably worthless". A smooth curve can be useful to display a pattern that the data already follows. If the data does not follow a smooth pattern, one should not draw a curve for the sake of drawing. The data points on the first plot should just be presented as is, without any connecting or approximating curves. The data comes from hourly reading taken from 9:00 to 16:00 (with one stray 17:00 value mixed it - throw it out.) This structure matters. Do not pretend that Tuesday 9:00 is what happens one hour after Monday 16:00. The data can be meaningfully summarized by daily totals Day Total 2018-04-02 289841 2018-04-03 256293 2018-04-04 401004 2018-04-05 419054 2018-04-06 375993 2018-04-09 266233 2018-04-10 278118 2018-04-11 151390 2018-04-12 318497 2018-04-13 310578 and by hourly averages (average number of events at 9:00, across all days, etc). Hour Average 9:00:00 16698.6 10:00:00 39705.2 11:00:00 47451.9 12:00:00 21758.8 13:00:00 20941.5 14:00:00 29086.1 15:00:00 65627 16:00:00 65411 In these things we can maybe observe some pattern. Here is the hourly one: hourly_averages = np.array([16698.6, 39705.2, 47451.9, 21758.8, 20941.5, 29086.1, 65627, 65411]) hours = np.arange(9, 17) hourly_s = 0.1*np.diff(hourly_averages).max()**2 hourly_spline = interp.UnivariateSpline(hours, hourly_averages, s=hourly_s) xx = np.linspace(min(hours), max(hours), 1000) # 1000 is an arbitrary number here. plt.plot(hours, hourly_averages, 'bo') plt.plot(xx, hourly_spline(xx), 'r') plt.show() The curve shows the lunch break and the end-of-day rush. My choice of s as 0.1*np.diff(hourly_averages).max()**2 is not canonical, but it recognizes the fact that s scales as the square of the residuals. (Documentation). I'll use the same choice for daily averages: daily_totals = np.array([289841, 256293, 401004, 419054, 375993, 266233, 278118, 151390, 318497, 310578]) days = np.arange(len(daily_totals)) daily_s = 0.1*np.diff(daily_totals).max()**2 daily_spline = interp.UnivariateSpline(days, daily_totals, s=daily_s) xx = np.linspace(min(days), max(days), 1000) # 1000 is an arbitrary number here. plt.plot(days, daily_totals, 'bo') plt.plot(xx, daily_spline(xx), 'r') plt.show() This is less useful. Maybe we need a longer period of observations. Maybe we should not pretend that Monday comes after Friday. Maybe averages should be taken for each day of week to uncover a weekly pattern, but with only two weeks there is not enough to play with. Technical details: the method UnivariateSpline chooses as few knots as possible so that a certain weighed sum of squared deviations from the data is at most s. With large s this will mean very few knots, until none remain, and we get a single cubic polynomial. How large s needs to be depends on the amount of oscillation in the vertical direction, which is extremely high in this example.