STL decomposition Python - graph is plotted, values are N/A - python

I have a time series with 1hr time interval, which I'm trying to decompose - with seasonality of a week.
Time Total_request
2018-04-09 22:00:00 1019656
2018-04-09 23:00:00 961867
2018-04-10 00:00:00 881291
2018-04-10 01:00:00 892974
import pandas as pd
import statsmodels as sm
d.reset_index(inplace=True)
d['env_time'] = pd.to_datetime(d['env_time'])
d = d.set_index('env_time')
s=sm.tsa.seasonal_decompose(d.total_request, freq = 24*7)
This gives me a resulting graphs of Seasonal, Trend, Residue - https://imgur.com/a/CjhWphO
But on trying to extract the residual values using s.resid I get this -
env_time
2018-04-09 20:00:00 NaN
2018-04-09 21:00:00 NaN
2018-04-09 22:00:00 NaN
I get values when I modify it to a lower frequency. What's strange is why I can't derive the values, when it's being plotted. I have found similar questions being asked, none of the answers were relevant to this case.

Related

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

Custom resample function: only sample similar values hourly - Irregular time series

I am quite new to the game and can't seem to find an answer to my problem online.
I have an somewhat irregular time series in Python (mostly I use Pandas to work with it), which has a datetime index (roughly every 15 minutes) and multiple columns with values. I know that those values are approximatly changing every hour, but they actually don't quite match up with the index I have. It looks something like this:
Values
2019-08-27 02:15:00 91.45
2019-08-27 02:30:00 91.44
2019-08-27 02:45:00 91.44
2019-08-27 03:00:00 91.43
2019-08-27 03:15:00 91.43
2019-08-27 03:30:00 91.43
2019-08-27 03:45:00 91.42
This is just an example, but one can see that the values change at random times (:15, :45, :00) and even tho they should change every hour sometimes there are only two 15 min intervalls with values, so I can't just say: take a group of 4 values and resample them to one hour.
So my idea was to use the if and else function to create something like this:
if a value is the same as the next one: resample those to an hour
else: add one hour to the resampled index.
How could I accomplish that in Python and does my idea even make sense??
Thanks in advance for any kind of help!
You can use pandas.resample.
Ex:
import pandas as pd
index = pd.date_range('2019-08-27 02:15:00', periods=30, freq='15min')
series = pd.Series(range(30), index=index)
series.resample('15min').mean()
2019-08-27 02:00:00 1.0
2019-08-27 03:00:00 4.5
2019-08-27 04:00:00 8.5
2019-08-27 05:00:00 12.5
2019-08-27 06:00:00 16.5
2019-08-27 07:00:00 20.5
2019-08-27 08:00:00 24.5
2019-08-27 09:00:00 28.0
Freq: H, dtype: float64
Pandas is not Python.
When you use plain Python, you have a simple and nice procedural language and you iterate over values in containers. When you use Pandas, you should try hard to avoid any explicit Python loop at Python level. The rationale is that Pandas (and numpy for the underlying containers) uses C optimized code. So you have a large gain when using pandas and numpy tools (it is called vectorization).
Here what you want already exists in Pandas and is called resample.
In you example, and provided the index is a true DatetimeIndex (*), you just do:
df2 = df.resample('1H').mean()
It gives:
Values
2019-08-27 02:00:00 91.443333
2019-08-27 03:00:00 91.427500
(*) If not, convert it first with: df.index = pd.to_datetime(df.index)
From your edit, I think that you want to get one value from each period. A possible way would be to take the most frequent one in the interval H-15T H+30T.
You could then use:
pd.DataFrame(df.resample('60T', base=45, loffset=pd.Timedelta(minutes=15)).agg(
lambda x: x['Values'].value_counts().index[0]).rename('Values'))
This one give:
Values
2019-08-27 02:00:00 91.45
2019-08-27 03:00:00 91.43
2019-08-27 04:00:00 91.42

Calculate the sum between the fixed time range using Pandas

My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.

Data Frame in Panda with Time series data

I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?
To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()

Data changes while interpolating data frame using Pandas and numpy

I am trying to calculate degree hours based on hourly temperature values.
The data that I am using has some missing days and I am trying to interpolate that data. Below is some part of the data;
2012-06-27 19:00:00 24
2012-06-27 20:00:00 23
2012-06-27 21:00:00 23
2012-06-27 22:00:00 16
2012-06-27 23:00:00 15
2012-06-29 00:00:00 15
2012-06-29 01:00:00 16
2012-06-29 02:00:00 16
2012-06-29 03:00:00 16
2012-06-29 04:00:00 17
2012-06-29 05:00:00 17
2012-06-29 06:00:00 18
....
2014-12-14 20:00:00 1
2014-12-14 21:00:00 0
2014-12-14 22:00:00 -1
2014-12-14 23:00:00 8
The full code is;
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
filename = 'Temperature12.xls'
df_temp = pd.read_excel(filename)
df_temp = df_temp.set_index('datetime')
ts_temp = df_temp['temp']
def inter_lin_nan(ts_temp, rule):
ts_temp = ts_temp.resample(rule)
mask = np.isnan(ts_temp)
# interpolling missing values
ts_temp[mask] = np.interp(np.flatnonzero(mask), np.flatnonzero(~mask),ts_temp[~mask])
return(ts_temp)
ts_temp = inter_lin_nan(ts_temp,'1H')
print ts_temp['2014-06-28':'2014-06-29']
def HDH (Tcurr,Tref=15.0):
if Tref >= Tcurr:
return ((Tref-Tcurr)/24)
else:
return (0)
df_temp['H-Degreehours'] = df_temp.apply(lambda row: HDH(row['temp']),axis=1)
df_temp['CDD-CUMSUM'] = df_temp['C-Degreehours'].cumsum()
df_temp['HDD-CUMSUM'] = df_temp['H-Degreehours'].cumsum()
df_temp1=df_temp['H-Degreehours'].resample('H', how=sum)
print df_temp1
Now I have two questions; while using inter_lin_nan function, it does interpolate data but it also changes the next day data and the next data is totally different from the one available in the excel file. Is this common or I have missed something?
Second question: At the end of the code I am trying to add hourly degree days values and that is why I have created another Data frame, but when I print that data frame, it still has NaN number as in the original data file. Could you please tell why this is happening?
I may be missing something very obvious as I am new to Python.
Don't use numpy when pandas has its own version.
df = pd.read_csv(filepath)
df =df.asfreq('1d') #get a timeseries with index timestamps each day.
df['somelabel'] = df['somelabel'].interpolate(method='linear') # interpolate nan values
Use as frequency to add the required frequency of timestamps to your time series, and uses interpolate() to interpolate nan values only.
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.Series.interpolate.html
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.asfreq.html

Categories

Resources