Time Series plot of timestamps in monthly buckets using Python/Pandas [duplicate] - python

I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.

You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128

Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()

Related

Pandas: easier way to sample interpolated time series data at given times (e.g. every full day)

Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)
Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.
Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:
df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))
And then either use pd.merge:
ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')
or pd.concat:
ldf = pd.concat([df_in, df_sampling], axis=1)
to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate and then sub-select all index values given by df_sampling. See the gist for details.
All this feels too cumbersome and I guess there should be a better way how to do it.
Instead of using either merge or concat inside your function generate_interpolated_time_series, I would rely on df.reindex. Something like this:
def f(df_in, freq='T', start=None):
if start is None:
start = df_in.index[0].floor('T')
# refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
end = df_in.index[-1]
idx = pd.date_range(start=start, end=end, freq=freq)
ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
return ldf
Test sample:
from pandas import Timestamp
d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
Timestamp('2022-11-19 04:53:18.532000'): 47.5,
Timestamp('2022-11-19 16:30:04.564000'): 66.9,
Timestamp('2022-11-21 04:17:57.832000'): 96.9,
Timestamp('2022-12-05 22:26:48.354000'): 118.6}
df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])
print(df)
values
2022-10-07 11:06:09.957 21.9
2022-11-19 04:53:18.532 47.5
2022-11-19 16:30:04.564 66.9
2022-11-21 04:17:57.832 96.9
2022-12-05 22:26:48.354 118.6
Check for equality:
merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')
print(all([merge.equals(concat),merge.equals(reindex)]))
# True
Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit) for different frequencies (['D','H','T','S']). reindex in green is fastest for each.
Aside: in your function, raise Exception('Method unknown: ' + metnhod) contains a typo; should be method.

How to average 2D array columns based on date header

I am working on some glaicer borehole temperature data consisting of ~1,000 rows by 700 columns. The vertical index is depth (i.e. as you move down the array depth increases) and the column headers are datetime values (i.e. as you move right along the array you move forwards in time).
I am looking for a way to average all temperatures in the columns depending on a date sampling rate. For example, the early datetimes have a spacing of 10 minutes, but the later datetimes have a spacing of six hours.
It would be good to be able to put in the sampling as an input and get out data based on that sampling rate so that I can see which one works best.
It would also be good that if I choose say 3 hour sampling this is simply ignored for spacing of above 3 hours and no change to the data is made in this case (i.e. datetime spacings of 10 minutes are averaged, but datetime spacings of 6 hours are left unaffected).
All of this needs to come out in either a pandas dataframe with date as column headers and depth as the index, or as a numpy array and separate list of datetimes.
I'm fairly new to Python, and this is my first question on stackoverflow!! Thanks :)
(I know the following is not totally correct use of Pandas, but it works for the figure slider I've produced!)
import numpy as np
import pandas as pd
#example array
T = np.array([ [-2, -2, -2, -2.1, -2.3, -2.6],
[-2.2, -2.3, -3, -3.1, -3.3, -3.3],
[-4, -4, -4.5, -4.4, -4.6, -4.5]])
#example headers at 8 and then 4 hour spacing
headers = [pd.date_range(start='2018-04-24 00:00:00', end='2018-04-24 08:00:00', periods=3).tolist() +
pd.date_range(start='2018-04-24 12:00:00', end='2018-04-25 12:00:00', periods=3).tolist()]
#pandas dataframe in same setup as much larger one I'm using
T_df = pd.DataFrame(T, columns = headers)
One trick you can use is to convert your time series to a numeric series, and then use the groupby method.
For instance, imagine you have
df = pd.DataFrame([['10:00:00', 1.],['10:10:00', 2.],['10:20:00', 3.],['10:30:00', 4.]],columns=['Time', 'Value'])
df.Time = pd.to_datetime(df.Time, format='%X')
You can convert your time series by:
df['DeltaT'] = pd.to_timedelta(df.Time).dt.total_seconds().astype(int)
df['DeltaT'] -= df['DeltaT'][0] # To start to 0
Then use the groupby method. You can for instance create a new column to floor the time interval you want:
myInterval = 1200.
df['group'] = (df['DeltaT']/myInterval).astype(int)
So you can use groupby followed by mean() (or a function you define)
df.groupby('group').mean()
Hope this helps!

Creating time series DataFrame from event data

I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.
I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.

Pandas dataframe groupby datetime month

Consider a CSV file:
string,date,number
a string,2/5/11 9:16am,1.0
a string,3/5/11 10:44pm,2.0
a string,4/22/11 12:07pm,3.0
a string,4/22/11 12:10pm,4.0
a string,4/29/11 11:59am,1.0
a string,5/2/11 1:41pm,2.0
a string,5/2/11 2:02pm,3.0
a string,5/2/11 2:56pm,4.0
a string,5/2/11 3:00pm,5.0
a string,5/2/14 3:02pm,6.0
a string,5/2/14 3:18pm,7.0
I can read this in, and reformat the date column into datetime format:
b = pd.read_csv('b.dat')
b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?
What I am currently trying is re-indexing by the date:
b.index = b['date']
I can access the month like so:
b.index.month
However I can't seem to find a function to lump together by month.
Managed to do it:
b = pd.read_csv('b.dat')
b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.groupby(by=[b.index.month, b.index.year])
Or
b.groupby(pd.Grouper(freq='M')) # update for v0.21+
(update: 2018)
Note that pd.Timegrouper is depreciated and will be removed. Use instead:
df.groupby(pd.Grouper(freq='M'))
To groupby time-series data you can use the method resample. For example, to groupby by month:
df.resample(rule='M', on='date')['Values'].sum()
The list with offset aliases you can find here.
One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.
Normalise day of month
df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']),
'Values': [5, 10, 15, 20]})
# normalize day to beginning of month, 4 alternative methods below
df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D')
df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1))
df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback)
Then use groupby as normal:
g = df.groupby('YearMonth')
res = g['Values'].sum()
# YearMonth
# 2017-09-01 20
# 2017-10-01 30
# Name: Values, dtype: int64
Comparison with pd.Grouper
The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:
some_group = g.get_group('2017-10-01')
Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.
Comparison with string conversion
An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).
Slightly alternative solution to #jpp's but outputting a YearMonth string:
df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))
res = df.groupby('YearMonth')['Values'].sum()

How to resample timedeltas?

I have been running an experiment that outputs data with two columns:
seconds since start of experiment (float)
a measurement. (float)
I would now like to load this into Pandas to resample and plot the measurements. I've done this before, but those times my timestamps have been since epoch or in datetime (YYY-MM-DD HH:mm:ss) format. If I'm loading my first column as integers I'm unable to do
data.resample('5Min', how='mean')
. It also does not seem possible if I'd convert my first column to timedelta(seconds=...). My question is, is it possible to resample this data without subverting to epoch conversion?
You can use groupby with time // period to do this:
import pandas as pd
import numpy as np
t = np.random.rand(10000)*3600
t.sort()
v = np.random.rand(10000)
df = pd.DataFrame({"time":t, "value":v})
period = 5*60
s = df.groupby(df.time // period).value.mean()
s.index *= period
I have the same structure of sensors data. First column is seconds since start of the experiment and rest of columns are value.
here is the data structure:
time x y z
0 0.015948 0.403931 0.449005 -0.796860
1 0.036006 0.403915 0.448029 -0.795395
2 0.055885 0.404907 0.446548 -0.795853
here is what worked for me:
convert the time to time delta:
df.time=pd.to_timedelta(df.time,unit="s")
set the time as index
df.set_index("time",inplace=True)
resample to the frequency you want
df.resample("40ms").mean()

Categories

Resources