Complex pivot and resample - python

I'm not sure where to start with this so apologies for my lack of an attempt.
This is the initial shape of my data:
df = pd.DataFrame({
'Year-Mth': ['1900-01'
'Category': ['A','A','B','B','B','B','B'],
'SubCategory': ['X','Y','Y','Y','Z','Q','Y'],
'counter': [1,1,1,1,1,1,1]
This is the result I'd like to get to - the Mth-Year in the below has been resampled to 4 year buckets:
If possible I'd like to do this via a process that makes 'Year-Mth' resamplable - so I can easily switch to different buckets.

Here's my attempt:
df['Year'] = pd.cut(df['Year-Mth'].str[:4].astype(int),
bins=np.arange(1900, 1920, 5), right=False)
df.pivot_table(index=['SubCategory', 'Year'], columns='Category',
values='counter', aggfunc='sum').dropna(how='all').fillna(0)
Category A B
SubCategory Year
Q [1910, 1915) 0.0 1.0
X [1900, 1905) 1.0 0.0
Y [1900, 1905) 1.0 2.0
[1910, 1915) 0.0 1.0
Z [1900, 1905) 0.0 1.0
The year column is not parameterized as pandas (or numpy) does not offer a cut option with step size, as far as I know. But I think it can be done with a little arithmetic on minimums/maximums. Something like:
df['Year'] = pd.to_datetime(df['Year-Mth']).dt.year
df['Year'] = pd.cut(df['Year'], bins=np.arange(df['Year'].min(),
df['Year'].max() + 5, 5), right=False)
This wouldn't create nice bins like Excel does, though.

cols = [df.SubCategory, pd.to_datetime(df['Year-Mth']), df.Category]
df1 = df.set_index(cols).counter
df1.unstack('Year-Mth').T.resample('60M', how='sum').stack(0).swaplevel(0, 1).sort_index().fillna('')


Upsampling and dividing data in pandas

I am trying to upsample a pandas datetime-indexed dataframe, so that resulting data is equally divided over the new entries.
For instance, let's say I have a dataframe which stores a cost each month, and I want to get a dataframe which summarizes the equivalent costs per day for each month:
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
Daily costs are 1$ (or whatever currency you like) in January, and 0.5$ in February. My goal in picture:
After a lot of struggle, I managed to obtain the next code snippet which seems to do what I want:
# add a value to perform a correct resampling
df.loc[df.index.max() + relativedelta(months=1)] = 0
# forward-fill over the right scale
# then divide each entry per the number of rows in the month
df = (df
.groupby(lambda x: datetime(x.year, x.month, 1))
.transform(lambda x: (x / x.count()))
However, this is not entirely ok:
using transform forces me to have dataframes with a single column ;
I need to hardcode my original frequency several times in different formats (while adding an extra value at the end of the dataframe, and in the groupby), making a function design hard ;
It only works with evenly-spaced datetime index (even if it's ok in my case) ;
it remains complex.
Does anyone have a suggestion to improve that code snippet ?
What if we took df's month indices and expanded them into days range, while dividing df's values by a number those days and assigning to each day, all by list comprehensions (edit: for equally distributed values per day):
import pandas as pd
# initial DataFrame
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
# reformat to months
df.index = df.index.strftime('%m-%Y')
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
[v / pd.Period(i).days_in_month # each month's value divided by n of days in a month
for d in range(pd.Period(i).days_in_month)], # repeated for as many times as there are days
index=pd.date_range(start=i, periods=pd.Period(i).days_in_month, freq='D')) # days range
for i, v in df.iterrows()]) # for each df's index and value
2023-01-01 1.0
2023-01-02 1.0
2023-01-03 1.0
2023-01-04 1.0
2023-01-05 1.0
2023-01-06 1.0
2023-01-07 1.0
2023-01-08 1.0
2023-01-09 1.0
2023-01-10 1.0
2023-01-11 1.0
... ...
2023-02-13 0.5
2023-02-14 0.5
2023-02-15 0.5
2023-02-16 0.5
2023-02-17 0.5
2023-02-18 0.5
2023-02-19 0.5
2023-02-20 0.5
2023-02-21 0.5
2023-02-22 0.5
2023-02-23 0.5
2023-02-24 0.5
2023-02-25 0.5
2023-02-26 0.5
2023-02-27 0.5
2023-02-28 0.5
What could be done to avoid uniform distribution of daily costs and for the cases with multiple columns? Here's an extended df:
# additional columns and a row
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31, 62, 23],
[pd.to_datetime('2023-02-01'), 14, 28, 51],
[pd.to_datetime('2023-03-01'), 16, 33, 21]],
columns=['time', 'cost1', 'cost2', 'cost3']
# reformat to months
df.index = df.index.strftime('%m-%Y')
cost1 cost2 cost3
01-2023 31 62 23
02-2023 14 28 51
03-2023 16 33 21
Here's what I came up for the cases where monthly costs may be upsampled by randomized daily costs, inspired by this question. This solution is scalable to the number of columns and rows:
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
# here we make a Series with random Dirichlet distributed numbers
# with length of a month and a column's value as the sum
[pd.Series((np.random.dirichlet(np.ones(pd.Period(i).days_in_month), size=1)*v
).flatten()) # the product is an ndarray that needs flattening
for v in row], # for every column value in a row
# index renamed as columns because of the created DataFrame's shape
# transpose and set the proper index
for i, row in df.iterrows()]) # iterate over every row
cost1 cost2 cost3
2023-01-01 1.703177 1.444117 0.160151
2023-01-02 0.920706 3.664460 0.823405
2023-01-03 1.210426 1.194963 0.294093
2023-01-04 0.214737 1.286273 0.923881
2023-01-05 1.264553 0.380062 0.062829
... ... ... ...
2023-03-27 0.124092 0.615885 0.251369
2023-03-28 0.520578 1.505830 1.632373
2023-03-29 0.245154 3.094078 0.308173
2023-03-30 0.530927 0.406665 1.149860
2023-03-31 0.276992 1.115308 0.432090
90 rows × 3 columns
To assert the monthly sum:
cost1 cost2 cost3
2023-01-31 31.0 62.0 23.0
2023-02-28 14.0 28.0 51.0
2023-03-31 16.0 33.0 21.0

Add missing timestamp values in dataframe column, in timerange [duplicate]

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
s = pd.Series([1, 2, 3], dates)
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
fails with:
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.index = pd.DatetimeIndex(df.index)
d =
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')

Create new Row in Data Frame with ID and date if ID and date do not exist in "x" timeframe [duplicate]

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
s = pd.Series([1, 2, 3], dates)
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
fails with:
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.index = pd.DatetimeIndex(df.index)
d =
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')

Create multiple dataframes out of several dataframes multipling value according to "Base"-Dictionary

I got a dictionary of dataframes ,dataframes, which are time based profiles with values x and y for each point in time. The following shows the dictionary and one dataframe of this dictionary
dataframes={'SG1':Dataframe, 'SG2':Dataframe, 'SG3':Dataframe, 'SG4':Dataframe, 'SG5':Dataframe}
value x
value y
With those I want to create a dictionary of new dataframes; dataframes2, where the values x and y are a sum of the old dataframes multiplied with a value.
This value is contained in another nested dictionary:
base_dict={'area1':{'SG1':0.0,'SG2':1.0}, 'area2':{'SG1':1.0,'SG2':0.0}}
( note: I shorted the dictionary)
At the end dataframes2 should look like this:
dataframes2={'area1':Dataframe, 'area2':Dataframe}
While area1 looks like this:
value x
value y
0.0 * SG1 value x+1.0* SG2 value x
0.0 * SG1 value y+1.0 * SG2 value y
0.0 * SG1 value x+ 1.0 * SG2 value x
0.0 * SG1 value y+1.0 * SG2 value y
I think of using mutliple for loops, but I am not really sure where to start here.
Can you help me?
If all your data frames have the same dates and times, you can do it in one for loop, iterating over keys and values of base_dict and creating an entry in dataframes2 for each key:
for area, vals in base_dict.items():
df_keys = list(vals.keys())
dataframes2[area] = pd.DataFrame({'Date': dataframes[df_keys[0]].Date,
'Time': dataframes[df_keys[0]].Time,
'value x': dataframes[df_keys[0]]['value x']*vals[df_keys[0]] + dataframes[df_keys[1]]['value y']*vals[df_keys[1]],
'value y': accordingly
If the timestamps are different, you can do a similar approach, but instead of just creating a new data frame, you'll need to work with merges.
Edit: We had this discussion in the comments but since I cannot post a code sample there, here is a full minimal example of my code that does not return a 1x1 Dataframe:
df1 = pd.DataFrame({'x': range(5), 'y': range(5, 10), 'Date': pd.date_range(start='1/1/2018', periods=5)})
df2 = pd.DataFrame({'x': range(10,15), 'y': range(15,20), 'Date': pd.date_range(start='1/1/2018', periods=5)})
dataframes2 = {}
dataframes={'SG1': df1, 'SG2': df2}
base_dict={'area1':{'SG1':0.0,'SG2':1.0}, 'area2':{'SG1':1.0,'SG2':0.0}}
for area, vals in base_dict.items():
df_keys = list(vals.keys())
dataframes2[area] = pd.DataFrame({'Date': dataframes[df_keys[0]].Date,
'value x': dataframes[df_keys[0]]['x']*vals[df_keys[0]] + dataframes[df_keys[1]]['y']*vals[df_keys[1]]})
{'area1': Date value x
0 2018-01-01 15.0
1 2018-01-02 16.0
2 2018-01-03 17.0
3 2018-01-04 18.0
4 2018-01-05 19.0,
'area2': Date value x
0 2018-01-01 0.0
1 2018-01-02 1.0
2 2018-01-03 2.0
3 2018-01-04 3.0
4 2018-01-05 4.0}
In fact, it is even possible to use a comprehension here. The trick is that you can use operations on full dataframes, provided they have same indexes and columns. So if you hide the Date and Time columns in the index, everything works fine:
dataframes2 ={k: sum(dataframes[name].set_index(['Date ', 'Time ']) * coeff
for name, coeff in d.items()).reset_index()
for k,d in base_dict.items()}
Note:This is based on Darina's Answer. Since this seems to much to put it in the comments I wrote it as a different answer. I changed the code, since Darina's answer a series object was created and the dataframes resultet in a size of 1*1, which is not the goal here.
`for area, vals in base_dict.items():
df_keys = list(vals.keys())
x=dataframes[df_keys[0]]['value x']*vals[df_keys[0]] + dataframes[df_keys[1]]['value x']*vals[df_keys[1]]
y=dataframes[df_keys[0]]['value y']*vals[df_keys[0]] + dataframes[df_keys[1]]['value y']*vals[df_keys[1]]
dataframes2[area] = pd.concat([d.to_frame(name='Date'),t.to_frame(name='Time'),x.to_frame(name='value x'),y.to_frame(name='value y')], axis=1)`
I avoided .Date and .Time since the datatype of the 'Date' and 'Time' values was not datetime

Groupby and interpolate in Pandas

I have data that has a week number, account id, and several usage columns. I'd like to a) group by account ID, b) resample weekly data into daily, and c) interpolate daily data evenly (divide the weekly by 7), then bring it all back together. I've got most of it down, but Pandas groupby confuses me a little. It's also very slow, which makes me think this might not be the optimal solution.
Data looks like this:
Account Id year week views stats foo_col
31133 213 2017-03-05 4.0 2.0 11.0
10085 456 2017-03-12 1.0 6.0 3.0
49551 789 2017-03-26 1.0 6.0 27.0
Here's my code:
def interpolator(mini_df):
mini_df = mini_df[cols_to_interpolate].set_index('year week')
return mini_df.resample('D').ffill().interpolate() / 7
example = list(grp)[0][1]
interpolator(example) # This works perfectly
df.groupby('Account Id').agg(interpolator) # doesn't work
df.groupby('Account Id').transform(interpolator) # doesn't work
for name,group in grp:
group = group[cols_to_interpolate].set_index('year week')
group = group.resample('D').ffill().interpolate() / 7 # doesn't work
for acc_id in df['Account Id'].unique():
mask = df.loc[df['Account Id'] == acc_id]
print(df[mask]) # doesn't work
I hope your function should be chained with groupby object like:
df = (df.set_index('year week')
.groupby('Account Id')[cols_to_interpolate]
.interpolate() / 7)
Solution from comments is different - interpolate is apply for each group:
df.groupby('Account Id').apply(interpolator)

