Business days between two columns of dates with Pandas Groupby - python

I have a Dataframe in Pandas with a letter and two dates as columns. I would like to calculate the difference between the two date columns for the previous row using shift(1) provided that the Lettervalue is the same (using a groupby). The complex part is I would like to calculate business days, not just elapsed days. The best way I have found to do that is using a numpy.busday_count, which takes two lists as an argument. I am essentially trying to use .apply to make each row it's own list. Not sure if this is the best way to do it, but running into some problems, which are ambiguous.
import pandas as pd
from datetime import datetime
import numpy as np
# create dataframe
df = pd.DataFrame(data=[['A', datetime(2016,01,07), datetime(2016,01,09)],
['A', datetime(2016,03,01), datetime(2016,03,8)],
['B', datetime(2016,05,01), datetime(2016,05,10)],
['B', datetime(2016,06,05), datetime(2016,06,07)]],
columns=['Letter', 'First Day', 'Last Day'])
# convert to dates since pandas reads them in as time series
df['First Day'] = df['First Day'].apply(lambda x: x.to_datetime().date())
df['Last Day'] = df['Last Day'].apply(lambda x: x.to_datetime().date())
df['Gap'] = (df.groupby('Letter')
.apply(
lambda x: (
np.busday_count(x['First Day'].shift(1).tolist(),
x['Last Day'].shift(1).tolist())))
.reset_index(drop=True))
print df
I get the following error on the lambda function. I'm not sure what object it's having problems with as the two passed arguments should be dates:
ValueError: Could not convert object to NumPy datetime
Desired Output:
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NAN
1 A 2016-03-01 2016-03-08 1
2 B 2016-05-01 2016-05-10 NAN
3 B 2016-06-05 2016-06-07 7

The following should work - first removing the leading zeros from the date digits):
df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), datetime(2016, 1, 9)],
['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
columns=['Letter', 'First Day', 'Last Day'])
df['Gap'] = df.groupby('Letter')
.apply(
lambda x:
pd.DataFrame(
np.busday_count(x['First Day'].tolist(), x['Last Day'].tolist())).shift())
.reset_index(drop=True)
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NaN
1 A 2016-03-01 2016-03-08 2.0
2 B 2016-05-01 2016-05-10 NaN
3 B 2016-06-05 2016-06-07 6.0
I don't think you need the .date() conversion.

Related

How to make new timeseries based on previous value in python pandas?

this is my first question on stack overflow so the formatting might be a bit off. I have a problem that I know a solution for with a for loop in python. However, I don't know if there is a way in pandas itself that does the same thing faster.
Problem:
Suppose I have a pandas Series 'in' consisting of an index date and where every date has a value (integer). There is also a Series 'out' that has the same structure.
Ex:
in
date val
2022-12-01 5
2022-12-02 8
2022-12-03 19
out
date val
2022-12-01 3
2022-12-02 7
2022-12-03 21
If I want to make a Series of the amount of events that are being processed each day, I could do it with a for loop in the following where the value of every day is open.iloc[i]=open[i-1]+in[i]-out[i]. The result should be
open
date val
2022-12-01 2 #5-3
2022-12-02 3 #2+8-7
2022-12-03 1 #3+19-21
Is there a way to do this in pandas itself, without the need for a for loop?
new answer
Use cumsum:
ser_open = ser_in.sub(ser_out).cumsum()
Output:
2022-12-01 2
2022-12-02 3
2022-12-03 1
dtype: int64
Used input:
ser_in = pd.Series([5, 8, 19], index=['2022-12-01', '2022-12-02', '2022-12-03'])
ser_out = pd.Series([3, 7, 21], index=['2022-12-01', '2022-12-02', '2022-12-03'])
initial answer
Use shift after setting date as index:
out = (df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)
.reset_index()
)
Or, for assignment use variant with map:
df_open['val'] = df_open['date'].map(df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)['val']
)
Output:
date val
0 2022-12-01 2.0
1 2022-12-02 3.0
2 2022-12-03 1.0
Used inputs:
df_in = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [5, 8, 19]})
df_out = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [3, 7, 21]})
df_open = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [2.0, 3.0, 1.0]})

How to prevent data from being recycled when using pd.merge_asof in Python

I am looking to join two data frames using the pd.merge_asof function. This function allows me to match data on a unique id and/or a nearest key. In this example, I am matching on the id as well as the nearest date that is less than or equal to the date in df1.
Is there a way to prevent the data from df2 being recycled when joining?
This is the code that I currently have that recycles the values in df2.
import pandas as pd
import datetime as dt
df1 = pd.DataFrame({'date': [dt.datetime(2020, 1, 2), dt.datetime(2020, 2, 2), dt.datetime(2020, 3, 2)],
'id': ['a', 'a', 'a']})
df2 = pd.DataFrame({'date': [dt.datetime(2020, 1, 1)],
'id': ['a'],
'value': ['1']})
pd.merge_asof(df1,
df2,
on='date',
by='id',
direction='backward',
allow_exact_matches=True)
This is the output that I would like to see instead where only the first match is successful
Given your merge direction is backward, you can do a mask on duplicated id and df2's date after merge_asof:
out = pd.merge_asof(df1,
df2.rename(columns={'date':'date1'}), # rename df2's date
left_on='date',
right_on='date1', # so we can work on it later
by='id',
direction='backward',
allow_exact_matches=True)
# mask the value
out['value'] = out['value'].mask(out.duplicated(['id','date1']))
# equivalently
# out.loc[out.duplicated(['id', 'date1']), 'value'] = np.nan
Output:
date id date1 value
0 2020-01-02 a 2020-01-01 1
1 2020-02-02 a 2020-01-01 NaN
2 2020-03-02 a 2020-01-01 NaN

python: what is wrong with my date index?

I have a dataframe that uses dates as index. Although I can read the index values from series.index, I fail to get the corresponding record.
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4], [datetime.date(2019,1,2), 'B', 6]], columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index', values='Value')
index = series2.index[0]
This far, everything works.
But this line of code fails:
row = series[index]
The error message is
KeyError: datetime.date(2019, 1, 1)
Why does it fail, and how can I fix it?
Use Series.loc for selecting, but in series2, because in series is RangeIndex, not dates:
row = series2.loc[index]
print (row)
Index
A 4.0
B NaN
Name: 2019-01-01, dtype: float64
Details:
print (series)
Date Index Value
0 2019-01-01 A 4
1 2019-01-02 B 6
print (series.index)
RangeIndex(start=0, stop=2, step=1)
print (series2)
Index A B
Date
2019-01-01 4.0 NaN
2019-01-02 NaN 6.0
Add this part after your three lines:
series.set_index('Date', inplace=True)
So, the whole thing is:
import pandas as pd
import datetime
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4],
[datetime.date(2019,1,2), 'B', 6]],
columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index',
values='Value')
index = series2.index[0]
series.set_index('Date', inplace=True) # this part was added
series.loc[index]
Out[57]:
Index A
Value 4
Name: 2019-01-01, dtype: object

Interval-matching PeriodIndex with DatetimeIndex during merge

Given the below DataFrames, I'd like to add series 'bar' from df_other2 into df2, so that the period of df2 (which I understand as an interval) "matches" the datetime index (not an interval) of df_other2 (also called "period", but is really a datetime). The matching criteria should be that df_other2.period is within df2's period (i.e. date is within the interval).
I was hoping that defining the target index as PeriodIndex and the source index as DatetimeIndex would be sufficient to do the matching, but that doesn't seem to be the case. What alternatives do I have to get this to work?
>>> df = pd.DataFrame({'period': PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='int64', freq='M'), 'foo': [1, 2, 3]})
>>> df2 = df.set_index('period')
>>> df2
foo x
period
2012-01 1 NaN
2012-02 2 NaN
2012-03 3 NaN
>>> df_other = pd.DataFrame({'period': [datetime.datetime(2012, 1, 1), datetime.datetime(2012, 2, 3), datetime.datetime(2012, 3, 10)], 'bar': ['a', 'b', 'c']})
>>> df_other2 = df_other.set_index('period')
>>> df_other2
bar
period
2012-01-01 a
2012-02-03 b
2012-01-10 c
>>> df2['x'] = df_other['bar']
>>> df2
foo x
period
2012-01 1 NaN # expected x='a' as '2012-1-1' is part of this period
2012-02 2 NaN # expected x='b'
2012-03 3 NaN # expected x='c'
I decided to align all df_other2.period with df2.period (and make them DatetimeIndex) and merge as usual.
I'll wait for better support in future releases:
For regular time spans, pandas uses Period objects for scalar values
and PeriodIndex for sequences of spans. Better support for irregular
intervals with arbitrary start and end points are forth-coming in
future releases.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-stamps-vs-time-spans

Pandas select columns with boolean date condition

I'd like to use a boolean index to select columns from a pandas dataframe with a datetime index as the column header:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=dates)
returns:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.173096 0.344348 1.059990 -1.246944 1.624399 -0.276052
B 0.277148 0.965226 -1.301612 -1.264500 -0.124489 1.704485
C -0.375106 0.103812 0.939749 -2.826329 -0.275420 0.664325
D 0.039756 0.631373 0.643565 -1.516543 -0.654626 -1.544038
I'd like to return only the first three columns.
I might do
>>> df.loc[:, df.columns <= datetime(2013, 1, 3)]
2013-01-01 2013-01-02 2013-01-03
A 1.058112 0.883429 -1.939846
B 0.753125 1.664276 -0.619355
C 0.014437 1.125824 -1.421609
D 1.879229 1.594623 -1.499875
You can do vectorized comparisons on the column index directly without using the map/lambda combination.
I had a nice long chat with the duck, and finally realised it was as simple as this:
print df.loc[:, :datetime(2013, 1, 3, 0, 0)]
I love Pandas.
EDIT:
Well, in fact that wasn't exactly what I wanted, because it relies on the 'query' date being present in the column headers. This is actually what I needed:
print df.loc[:, df.columns.map(lambda col: col < datetime(2013, 1, 3, 0, 0))]

Categories

Resources