Date ranges in Pandas - python

After fighting with NumPy and dateutil for days, I recently discovered the amazing Pandas library. I've been poring through the documentation and source code, but I can't figure out how to get date_range() to generate indices at the right breakpoints.
from datetime import date
import pandas as pd
start = date('2012-01-15')
end = date('2012-09-20')
# 'M' is month-end, instead I need same-day-of-month
date_range(start, end, freq='M')
What I want:
2012-01-15
2012-02-15
2012-03-15
...
2012-09-15
What I get:
2012-01-31
2012-02-29
2012-03-31
...
2012-08-31
I need month-sized chunks that account for the variable number of days in a month. This is possible with dateutil.rrule:
rrule(freq=MONTHLY, dtstart=start, bymonthday=(start.day, -1), bysetpos=1)
Ugly and illegible, but it works. How can do I this with pandas? I've played with both date_range() and period_range(), so far with no luck.
My actual goal is to use groupby, crosstab and/or resample to calculate values for each period based on sums/means/etc of individual entries within the period. In other words, I want to transform data from:
total
2012-01-10 00:01 50
2012-01-15 01:01 55
2012-03-11 00:01 60
2012-04-28 00:01 80
#Hypothetical usage
dataframe.resample('total', how='sum', freq='M', start='2012-01-09', end='2012-04-15')
to
total
2012-01-09 105 # Values summed
2012-02-09 0 # Missing from dataframe
2012-03-09 60
2012-04-09 0 # Data past end date, not counted
Given that Pandas originated as a financial analysis tool, I'm virtually certain that there's a simple and fast way to do this. Help appreciated!

freq='M' is for month-end frequencies (see here). But you can use .shift to shift it by any number of days (or any frequency for that matter):
pd.date_range(start, end, freq='M').shift(15, freq=pd.datetools.day)

There actually is no "day of month" frequency (e.g. "DOMXX" like "DOM09"), but I don't see any reason not to add one.
http://github.com/pydata/pandas/issues/2289
I don't have a simple workaround for you at the moment because resample requires passing a known frequency rule. I think it should be augmented to be able to take any date range to be used as arbitrary bin edges, also. Just a matter of time and hacking...

try
date_range(start, end, freq=pd.DateOffset(months=1))

Related

Find if there is any holidays between two dates in a large dataset?

I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.

Pandas TimeGrouper: Drop "non full groups"

I'm grouping my data on some frequency, but it appears that TimeGrouper creates a last group on the right for some "left over" data.
df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping'].plot()
I expect the data to be fairly constant over time, but the last data point at 2013 drops by almost half. I expect this to happen because with biannual grouping, the second half (2014) is missing.
rolling_mean allows center=True, which will will put NaN/drop remainders on the left and right. Is there a similar feature for the Grouper? I couldn't find any on the manual, but perhaps there is a workaround?
I don't think the issue here really concerns options available with TimeGrouper, but rather, how you want to deal with uneven data. You basically have 4 options that I can think of:
1) Drop enough observations (at the start or end) such that you have a multiple of 2 years worth of observations.
2) Extrapolate your starting (or ending) period such that it is comparable to the periods with complete data.
3) Normalize your data to 2 year sums based on underlying time periods of less than 2 years. This approach could be combined with the other two.
4) Instead of a groupby sort of approach, just do a rolling_sum.
Example dataframe:
rng = pd.date_range('1/1/2010', periods=60, freq='1m')
df = pd.DataFrame({ 'shopping' : np.random.choice(12,60) }, index=rng )
I just made the example data set with 5 years of data starting on Jan 1, so if you did this on an annual basis, you'd be done.
df.groupby([pd.TimeGrouper("AS", label='left')]).sum()['shopping']
Out[206]:
2010-01-01 78
2011-01-01 60
2012-01-01 76
2013-01-01 51
2014-01-01 60
Freq: AS-JAN, Name: shopping, dtype: int64
Here's your problem in table form, with the first 2 groups based on 2 years of data but the third group based on only 1 year of data.
df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping']
Out[205]:
2010-01-01 138
2012-01-01 127
2014-01-01 60
Freq: 2AS-JAN, Name: shopping, dtype: int64
If you take approach (1) above, you just need to drop some observations. It's very easy to drop the later observations and retype the same command. It's a little trickier to drop the earlier observations because then your first observation doesn't begin on Jan 1 of an even year and you lose the automatic labelling and such. Here's an approach that will drop the first year and keep the last 4, but you lose the nice labelling (you can compare with annual data above to verify that this is correct):
In [202]: df2 = df[12:]
In [203]: df2['group24'] = (np.arange( len(df2) ) / 24 ).astype(int)
In [204]: df2.groupby('group24').sum()['shopping']
Out[204]:
group24
0 136
1 111
Alternatively, let's try approach (2), extrapolating. To do this, just replace sum() with mean() and multiply by 24. For the last period, this just means we assume that the 60 in 2014 will be equaled by another 60 in 2015. Whether or not this is reasonable will be a judgement call for you to make and you'd probably want to label with an asterisk and call it an estimate.
df.groupby([pd.TimeGrouper("2AS")]).mean()['shopping']*24
Out[208]:
2010-01-01 138
2012-01-01 127
2014-01-01 120
Freq: 2AS-JAN, Name: shopping, dtype: float64
Also keep in mind this is just one simple (probably simplistic) way you could extrapolate at the end of the period. Whether this is the best way to do it (or whether it makes sense to extrapolate at all) is a judgement call for you to make depending on the situation.
Next, you could take approach (3) and do some sort of normalization. I'm not sure exactly what you want, so I'll just sketch the ideas. If you want to display two year sums, you could just use the earlier example of replacing "2AS" with "AS" and then multiply by 2. This basically makes the table look wrong, but would be a really simple way to make the graph look OK.
Finally, just use rolling sum:
pd.rolling_sum(df.shopping,window=24)
Doesn't table well, but would plot well.

Resample daily pandas timeseries with start at time other than midnight [duplicate]

This question already has answers here:
Resample hourly TimeSeries with certain starting hour
(3 answers)
Closed 9 years ago.
I have a pandas timeseries of 10-min freqency data and need to find the maximum value in each 24-hour period. However, this 24-hour period needs to start each day at 5AM - not the default midnight which pandas assumes.
I've been checking out DateOffset but so far am drawing blanks. I might have expected something akin to pandas.tseries.offsets.Week(weekday=n), e.g. pandas.tseries.offsets.Week(hour=5), but this is not supported as far as I can tell.
I can do a nasty work around by shifting the data first, but it's unintuitive and even coming back to the same code after just a week I have problems wrapping my head around the shift direction!
Any more elegant ideas would be much appreciated.
The base keyword can do the trick (see docs):
s.resample('24h', base=5)
Eg:
In [35]: idx = pd.date_range('2012-01-01 00:00:00', freq='5min', periods=24*12*3)
In [36]: s = pd.Series(np.arange(len(idx)), index=idx)
In [38]: s.resample('24h', base=5)
Out[38]:
2011-12-31 05:00:00 29.5
2012-01-01 05:00:00 203.5
2012-01-02 05:00:00 491.5
2012-01-03 05:00:00 749.5
Freq: 24H, dtype: float64
I've just spotted an answered question which didn't come up on Google or Stack Overflow previously:
Resample hourly TimeSeries with certain starting hour
This uses the base parameter, which looks like an addition subsequent to Wes McKinney's Python for Data Analysis. I've given the parameter a go and it seems to do the trick.

Ensure consistent time indexing in statsmodels predict

I am trying to fit an AR(1) model to a Pandas time series and project forward. The data is annual with each year starting at 1 April. When I use statsmodels.tsa.ar_model.AR.predict to forecast from the estimated model the output is a Pandas time series with the annual forecasts centred on 31 December.
Code:
mod1 = sm.tsa.AR(ser['1972-01-04':'2007-01-04'], freq='A')
res1 = mod1.fit(order=1)
fcast1 = res1.predict('2007-01-04', '2018-01-04')
print fcast1
Output:
2007-12-31 988.121031
2008-12-31 1035.640294
2009-12-31 1081.584720
...
Can I get the predict method to create a time series indexed on 1 April, or do I have to re-index the forecast series after creating it? I'd like to be able to compare it to other series in the dataframe so the indexing is quite important.
Thanks for your help!
No, not at the moment, but you should be able to in the next release. The fix should be fairly trivial. The pandas time-series stuff is relatively new compared to when I wrote the TSA infrastructure, and I just haven't had a chance to catch up. Too much to do.
https://github.com/statsmodels/statsmodels/issues/319

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources