Pandas remove daily seasonality from data by substracting daily mean - python

I have a big amount of timeseries sensor data in a pandas dataframe. The resolution of the data is one observation every 15 minutes for 1 month for 876 sensors.
The data has some daily seasonality and some faulty measurements in single sensors on about 50% of the observations.
I want to remove the seasonality.
df.diff(periods=96)
This does not work, because then I have an outlier on 2 days (the day with the actual faulty measurement and the day after.
Therefore I wrote this snippet of code which does what it should and works fine:
for index in df.index:
for column in df.columns:
df[column][index] = df[column][index] - (
df[column][df.index % 96 == index % 96]).mean()
The problem is that this is incredibly slow.
Is there a way to achieve the same thing with a pandas function significantly faster?

Iterating over a DataFrame/ Series should be your last resort, it's very slow.
In this case, you can use groupby + transform to compute the mean of each season for all the columns, and then subtract with from your DataFrame in a vectorized way.
Based on your code, it seems that this should work
period = 96
season_mean = df.groupby(df.index % period).transform('mean')
df -= season_mean
Or, if you want
period = 96
df = df.groupby(df.index % period).transform(lambda g: g - g.mean())

Related

Pandas: Resampling Hourly Data for each Group

I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row.
I am using the following code:
for i,j in enumerate(list(df.id.unique())):
data=df.loc[df.id==j]
data['hour']=data['timestamp'].hour
data_grouped=data.groupby(['imo','hour']).median().reset_index()
data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.
Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?
Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.
Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:
df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()

Add hours column to regular list of minutes, group by it, and average the data in Python

I have looked for similar questions, but none seems to be addressing the following challenge. I have a pandas dataframe with a list of minutes and corresponding values, like the following:
minute value
0 454
1 434
2 254
The list is a year-long list, thus counting 60 minutes * 24 hours * 365 days = 525600 observations.
I would like to add a new column called hour, which indeed expresses the hour of the day (assuming minutes 0-59 are 12AM, 60-119 are 1AM, and so forth until the following day, where the sequence restarts).
Then, once the hour column is added, I would like to group observations by it and calculate the average value for every hour of the year, and end up with a dataframe with 24 observations, each expressing the average value of the original data at each hour n.
Using integer and remainder division you can get the hour.
df['hour'] = df['minute']//60%24
If you want other date information it can be useful to use January 1st of some year (not a leap year) as the origin and convert to a datetime. Then you can grab a lot of the date attributes, in this case hour.
df['hour'] = pd.to_datetime(df['minute'], unit='m', origin='2017-01-01').dt.hour
Then for your averages you get the resulting 24 row Series with:
df.groupby('hour')['value'].mean()
Here's a way to do:
# sample df
df = pd.DataFrame({'minute': np.arange(525600), 'value': np.arange(525600)})
# set time format
df['minute'] = pd.to_timedelta(df['minute'], unit='m')
# calculate mean
df_new = df.groupby(pd.Grouper(key='minute', freq='1H'))['value'].mean().reset_index()
Although, you don't need hour column explicity to calculate these value, but if you want to get it, you can do it by:
df_new['hour'] = pd.to_datetime(df_new['minute']).dt.hour

Resample/reindex sensor data

I want to do some data processing to sensor data (about 300 different sensors). This is an example of the raw data from a temperature sensor:
"2018-06-30T13:17:05.986Z" 30.5
"2018-06-30T13:12:05.984Z" 30.3
"2018-06-30T13:07:05.934Z" 29.5
"2018-06-30T13:02:05.873Z" 30.3
"2018-06-30T12:57:05.904Z" 30
I want to resample the data to smooth datetimes:
13:00:00
13:05:00
13:10:00
...
I have written some code that works, but is incredibly slow when used on bigger files. My code just upsamples all the data to 1 sec via linear interpolation. and downsamples afterwards to the requested frequency.
Is there a faster method to achieve this?
EDIT:
sensor data is written into a database and my code loads data from an arbitrary time intervall from the database
EDIT2: My working code
upsampled = dataframe.resample('1S').asfreq()
upsampled = upsampled.interpolate(method=method, limit=limitT) # ffill or bfill for some sensors
resampled = upsampled.astype(float).resample(str(sampling_time) + 'S').mean() # for temperature
resampled = upsampled.astype(float).resample(str(sampling_time) + 'S').asfreq() # for everything else
You can first set the index for the dataframe as the column with timestamps, and then use resample() method to bring it to every 1sec or every 5min interval data.
For example:
temp_df = pd.read_csv('temp.csv',header=None)
temp_df.columns = ['Timestamps','TEMP']
temp_df = temp_df.set_index('Timestamps') #set the timestamp column as index
temp_re_df = temp_df.TEMP.resample('5T').mean()
You can set the period as argument to the resample() i.e T - min , S - sec , M - month, H - hour etc. and also apply a function like mean() or max() or min() to consider the down-sampling method.
P.S : This is given that that your timestamp are in datetime format of pandas. Else use pd.to_datetime(temp_df['Timestamps'],unit='s') to convert to datetime index column

Performance issue on operations involving pandas dataframes

I have a panda dataframe containing OHLC 1mn data. (19724 rows). I am looking at adding 2 new columns keeping tracks of the min price and the maximum price over the past 3 days (including today up to the current bar in question, and ignoring missing days). However I am running into performance issues as a %timeit of the for loop indicates 57 seconds... I am looking at ways to speed up (vectorization? I tried but I am struggling a little bit I must admit).
#Import the data and put them in a DataFrame. The DataFrame should contain
#the following fields: DateTime (the index), Open, Close, High, Low, Volume.
#----------------------
#The following assume the first column of the file is Datetime
dfData=pd.read_csv(os.path.join(DataLocation,FileName),index_col='Date')
dfData.index=pd.to_datetime(dfData.index,dayfirst=True)
dfData.index.tz_localize('Singapore')
# Calculate the list of unique dates in the dataframe to find T-2
ListOfDates=pd.to_datetime(dfData.index.date).unique()
#Add a ExtMin and and ExtMax to the dataFrame to keep track of the min and max over a certain window
dfData['ExtMin']=np.nan
dfData['ExtMax']=np.nan
#For each line in the dataframe, calculate the minimum price reached over the past 3 days including today.
def addMaxMin(dfData):
for index,row in dfData.iterrows():
#Find the index in ListOfDates, strip out the time, offset by -2 rows
Start=ListOfDates[max(0,ListOfDates.get_loc(index.date())-2)]
#Populate the ExtMin and ExtMax columns
dfData.ix[index,'ExtMin']=dfData[(Start<=dfData.index) & (dfData.index<index)]['LOW'].min()
dfData.ix[index,'ExtMax']=dfData[(Start<=dfData.index) & (dfData.index<index)]['HIGH'].max()
return dfData
%timeit addMaxMin(dfData)
Thanks.

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources