Pandas resample MultiIndex dataframe with forward fill - python

I am trying to resample a MultiIndex dataframe to a less granular frequency (daily to month end) by taking the last valid daily observation in every month.
For example, given the dataframe below:
df = pd.DataFrame({'date': [pd.to_datetime('2012-03-29')]*4
+ [pd.to_datetime('2012-03-30')]*4
+ [pd.to_datetime('2012-04-01')]*4,
'groups':[1,2,3,4]*3,
'values':np.random.normal(size=12)})
df = df.set_index(['date', 'groups'])
values
date groups
2012-03-29 1 0.013681
2 0.359522
3 -0.525454
4 -0.282541
2012-03-30 1 0.155501
2 -1.053596
3 0.003049
4 -0.165875
2012-04-01 1 -0.049135
2 2.701785
3 2.240875
4 0.057297
The desired final dataframe is:
values
date groups
2012-03-31 1 0.155501
2 -1.053596
3 0.003049
4 -0.165875
In a regular dataframe (with single index), the desired output can be achieved with df.asfreq('M', method='ffill') as shown below.
df = pd.DataFrame({'date': [pd.to_datetime('2012-03-29')] + pd.date_range('2012-04-01', '2012-04-04').to_list(),
'values':np.random.normal(size=5)})
df = df.set_index('date')
df_monthly = df.asfreq('M', method='ffill')
Where df is:
values
date
2012-03-29 1.988554
2012-04-01 -1.054163
2012-04-02 -1.112537
2012-04-03 0.224515
2012-04-04 0.152175
and df_monthly is:
values
date
2012-03-31 1.988554
Any help is much appreciated. Thanks in advance.

Use:
df_monthly = df.reset_index(level=1).groupby('groups')[['values']].apply(lambda x: x.asfreq('M', method='ffill')).swaplevel(1,0)
print (df_monthly)
values
date groups
2012-03-31 1 -2.951662
2 -1.495653
3 -0.948413
4 0.066219

Related

Substract one datetime column after a groupby with a time reference for each group from a second Pandas dataframe

I have one dataframe df1 with one admissiontime for each id.
id admissiontime
1 2117-04-03 19:15:00
2 2117-10-18 22:35:00
3 2163-10-17 19:15:00
4 2149-01-08 15:30:00
5 2144-06-06 16:15:00
And an another dataframe df2 with several datetame for each id
id datetime
1 2135-07-28 07:50:00.000
1 2135-07-28 07:50:00.000
2 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
I would like to substract for each id, datetimes with his specific admissiontime, in a column of the second dataframe.
I think I have to use d2.group.by('id')['datetime']- something but I struggle to connect with the df1.
Use Series.sub with mapping by Series.map by another DataFrame:
df1['admissiontime'] = pd.to_datetime(df1['admissiontime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])
df2['diff'] = df2['datetime'].sub(df2['id'].map(df1.set_index('id')['admissiontime']))

Create new Row in Data Frame with ID and date if ID and date do not exist in "x" timeframe [duplicate]

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')

Transposing Multiple dates by sorting min to max based on unique ID and appending to DataFrame in python

Given Data is
id
date
1
10/20/2019
2
11/02/2019
3
12/12/2019
1
02/06/2019
1
05/14/2018
3
5/13/2019
2
07/20/2018
3
08/23/2019
2
06/25/2018
I want in This format
id
date1
date2
date3
1
05/14/2018
02/06/2019
10/20/2019
2
06/25/2018
07/20/2018
11/02/2019
3
05/13/2019
08/23/2019
12/12/2019
I am using For Loop to implement this on 4,00,000+ Unique Ids and its time-consuming. Is there any easy method?
I am using this code:
Each Policy number has Multiple DATEs, I want them arranged in min to max in a row in different columns like mentioned in 2nd table.
f= pd.DataFrame()
for i in range(0,len(uni_pol)):
d=ct.loc[ct["Policy_no"]== uni_pol[I]]
t=d.sort values ('DATE", ascending=True).T
df=pd.DataFrame(t)
a=df. loc['Policy_no' ]
col=df.columns
df['Policy_no']= a.loc[ col[0] ]
for j in range(0, len(col)):
nn= str(j+1)
name="Paydt"+nn
df[name] = df[col[j]]
CC= col[j]
df=df.drop([cc], axi5-1)
j=j+1
f = f.append(df. loc[' DATE'])
Here's one approach:
sort_values by "date"; then groupby "id" and create a list from dates; this builds a Series. Then create a DataFrame from the lists in the Series:
df['date'] = pd.to_datetime(df['date'])
s = df.sort_values(by='date').groupby('id')['date'].agg(list)
out = pd.DataFrame(s.tolist(), index=s.index, columns=[f'date{i}' for i in range(1,len(s.iat[0])+1)]).reset_index()
Output:
id date1 date2 date3
0 1 2018-05-14 2019-02-06 2019-10-20
1 2 2018-06-25 2018-07-20 2019-11-02
2 3 2019-05-13 2019-08-23 2019-12-12

Pandas get the Month Ending Values from Series

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

Grouping by column groups on a data frama in python pandas

I have a data frame with columns for every month of every year from 2000 to 2016
df.columns
output
Index(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=200)
and I would like to group over these column by quarters.
I have made a dictionary believing it would be the best method to use groupby then use aggregate and mean:
m2q = {'2000q1': ['2000-01', '2000-02', '2000-03'],
'2000q2': ['2000-04', '2000-05', '2000-06'],
'2000q3': ['2000-07', '2000-08', '2000-09'],
...
'2016q2': ['2016-04', '2016-05', '2016-06'],
'2016q3': ['2016-07', '2016-08']}
but
df.groupby(m2q)
is not giving me the desired output.
In fact its giving me an empty grouping.
Any suggestions to make this grouping work?
Or perhaps a more pythonian solution to categorize by quarters taking the mean of the specified columns?
You can convert your index to DatetimeIndex(example 1) or PeriodIndex(example 2).
And also please check Time Series / Date functionality subject for more detail.
import numpy as np
import pandas as pd
idx = ['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10', '2000-11', '2000-12']
df = pd.DataFrame(np.arange(12), index=idx, columns=['SAMPLE_DATA'])
print(df)
SAMPLE_DATA
2000-01 0
2000-02 1
2000-03 2
2000-04 3
2000-05 4
2000-06 5
2000-07 6
2000-08 7
2000-09 8
2000-10 9
2000-11 10
2000-12 11
# Handle your timeseries data with pandas timeseries / date functionality
df.index=pd.to_datetime(df.index)
example 1
print(df.resample('Q').sum())
SAMPLE_DATA
2000-03-31 3
2000-06-30 12
2000-09-30 21
2000-12-31 30
example 2
print(df.to_period('Q').groupby(level=0).sum())
SAMPLE_DATA
2000Q1 3
2000Q2 12
2000Q3 21
2000Q4 30

Categories

Resources