Choosing values with df.quantile() for separate years and months - python

I have a large data set and I want to add values to a column based on the higest values in another column in my data set.
Easy, I can just use df.quantile() and access the appropriate values
However, I want to check for each month in each year.
I solved it for looking at years only, see code below.
I'm sure I could do it for months with nested for loops but I'd rather avoid it if I can.
I guess the most pythonic way would by to not loop at all but use pandas in a smarter way..
Any suggestion?
Sample code:
df = pd.read_excel(file)
df.index = df['date']
df = df.drop('date', axis=1)
df['new'] = 0
year = (2016, 2017, 2018, 2019, 2020)
for i in year:
df['new'].loc[str(i)] = np.where(df['cost'].loc[str(i)] < df['cost'].loc[str(i)].quantile(0.5), 0, 1)
print(df)
Sample input
file =
cost
date
2016-11-01 30
2016-12-01 29
2017-11-01 40
2017-12-01 45
2018-11-30 240
2018-12-01 200
2019-11-30 220
2019-12-30 180
2020-11-30 150
2020-12-30 130
Output
cost new
date
2016-11-01 30 1
2016-12-01 29 0
2017-11-01 40 0
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 0
2019-11-30 220 1
2019-12-30 180 0
2020-11-30 150 1
2020-12-30 130 0
Desired output (if quantile works like that on single values, but as an example)
cost new
date
2016-11-01 30 1
2016-12-01 29 1
2017-11-01 40 1
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 1
2019-11-30 220 1
2019-12-30 180 1
2020-11-30 150 1
2020-12-30 130 1
Thank you _/_

An interesting question, it took me a little while to work out a solution!
import pandas as pd
df = pd.DataFrame(data={"cost": [30, 29, 40, 45, 240, 200, 220, 180, 150, 130],
"date": ["2016-11-01", "2016-12-01", "2017-11-01",
"2017-12-01", "2018-11-30", "2018-12-01",
"2019-11-30", "2019-12-30", "2020-11-30",
"2020-12-30"]})
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["new"] = df.groupby([lambda x: x.year, lambda x: x.month]).transform(lambda x: (x >= x.quantile(0.5))*1)
#Out:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 1
#2017-11-01 40 1
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 1
#2019-11-30 220 1
#2019-12-30 180 1
#2020-11-30 150 1
#2020-12-30 130 1
What the important line does:
Groups by the index year and month
For each item in the group, calculates whether it is greater than or equal to the 0.5 quantile (as bool)
Multiplying by 1 creates an integer bool (1/0) instead of True/False
The initial creation of the dataframe should be equivalent to your df = pd.read_excel(file)
Leaving out the , lambda x: x.month part of the groupby (by year only), the output is the same as your current output:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 0
#2017-11-01 40 0
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 0
#2019-11-30 220 1
#2019-12-30 180 0
#2020-11-30 150 1
#2020-12-30 130 0

Related

Data Manipulation in multiple columns(absolute, percentage, and categorical) in pandas dataframe

I need to make a function, which takes input as dataframe, and dictionary{"Col_1" :% change,"Col_2":absolute change,"Col_3": 0/1(Categorical)} and it should make the changes to the dataframe.
I Have data frame like this
Date
col_1
col_2
col_3
01/01/2022
90
100
0
01/02/2022
80
110
1
01/03/2022
92
120
0
01/04/2022
96
130
0
01/05/2022
99
150
1
01/06/2022
105
155
1
Now I pass the dictionary say,
{"Date":["01/01/2022","01/02/2022"],"col_1":[-10,-10],"col_2":10,"col_3":[1,0]}
for "col_1" I am passing -10,-10 percentage change to its previous values on specified date.
for "col_2" I am passing an absolute number that is 10 (it should replace previous values by 10)
specified date.
for "col_3" I am passing a binary number and it updated in dataframe on specified date.
Then my desired out would look like this
Date
col_1
col_2
col_3
01/01/2022
81
10
1
01/02/2022
72
10
0
01/03/2022
92
120
0
01/04/2022
96
120
0
01/05/2022
99
150
1
01/06/2022
105
155
1
I followed tried this code:
def per_change(df,cols,d):
df[cols] = df[cols].add(df[cols].div(100).mul(pd.Series(d)), fill_value=0)
return df
but it didn't worked out. Please help!!
You could use dic["Date"] as a boolean mask and update values in df using the values under the other keys in dic:
msk = df['Date'].isin(dic['Date'])
df.loc[msk, 'col_1'] *= (1 + np.array(dic['col_1']) / 100)
df.loc[msk, 'col_2'] = dic['col_2']
df.loc[msk, 'col_3'] = dic['col_3']
Output:
Date col_1 col_2 col_3
0 01/01/2022 81.0 10 1
1 01/02/2022 72.0 10 0
2 01/03/2022 92.0 120 0
3 01/04/2022 96.0 130 0
4 01/05/2022 99.0 150 1
5 01/06/2022 105.0 155 1

comparing row wise each values in pandas data frame

My data frame looks like (almost 10M) -
date value1 value2
01/02/2019 10 120
02/02/2019 21 130
03/02/2019 0 140
04/02/2019 24 150
05/02/2019 29 160
06/02/2019 32 160
07/02/2019 54 160
08/02/2019 32 180
01/02/2019 -3 188
My final output looks like -
date value1 value2 result
01/02/2019 10 120 1
02/02/2019 21 130 1
03/02/2019 0 140 0
04/02/2019 24 150 1
05/02/2019 29 160 1
06/02/2019 32 160 0
07/02/2019 54 160 0
08/02/2019 32 180 1
01/02/2019 -3 188 0
My logic should if value1 <=0 or 3 consecutive rows(value2) is same then result is 0 otherwise 1
How to do it in pandas
You can try, defining your own function that handles consecutive values, and where value1 is above 0, then groupby using a custom series of consecutives and finally apply the custom function:
import pandas as pd
from io import StringIO
s = '''date,value1,value2
01/02/2019,10,120
02/02/2019,21,130
03/02/2019,0,140
04/02/2019,24,150
05/02/2019,29,160
06/02/2019,32,160
07/02/2019,54,160
08/02/2019,32,180
01/02/2019,-3,188'''
df = pd.read_csv(StringIO(s), header=0, index_col=0)
def fun(group_df):
if group_df.shape[0] >= 3:
return pd.Series([0]*group_df.shape[0], index=group_df.index)
else:
return group_df.value1 > 0
consecutives = (df.value2 != df.value2.shift()).cumsum()
df['results'] = df.groupby(consecutives).apply(
fun).reset_index(level=0, drop=True)
In this case fun is a vectorized function to check if consectives are 3 or more, or if values are greater than 0, the results are:
print(df)
# value1 value2 results
# date
# 01/02/2019 10 120 1
# 02/02/2019 21 130 1
# 03/02/2019 0 140 0
# 04/02/2019 24 150 1
# 05/02/2019 29 160 0
# 06/02/2019 32 160 0
# 07/02/2019 54 160 0
# 08/02/2019 32 180 1
# 01/02/2019 -3 188 0
Something like this
np.where((df.value1.le(0)) | (df.value2.diff().eq(0)), 0, 1)

More efficient time delta calculation python 3

Sorry I'm new to python.
I have a dataframe of entities that log values once a month. For each unique entity in my dataframe, I locate the max value then locate the max value's corresponding month. Using the max value month, a time delta between each other unique entity's months and the max month can be calculated in days. This works for small dataframes.
I know my loop is not performant and can't scale to larger dataframes(e.g., 3M rows (+156MB)). After weeks of research I've gathered that my loop is degenerate and feel there is a numpy solution or something more pythonic. Can someone see a more performant approach to calculating this time delta in days?
I've tried various value.shift(x) calculations in a lambda function, but the peak value is not consistent. I've also tried calculating more columns to minimize my loop calculations.
import pandas as pd
df = pd.DataFrame({'entity':['A','A','A','A','B','B','B','C','C','C','C','C'], 'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019','1/31/2009','2/28/2009','3/31/2009','8/31/2011','9/30/2011','10/31/2011','11/30/2011','12/31/2011'], 'value':['80','600','500','400','150','300','100','200','250','300','200','175'], 'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']})
df['month'] = df['month'].apply(pd.to_datetime)
for entity in set(df['entity']):
# set peak value
peak_value = df.loc[df['entity'] == entity, 'value'].max()
# set peak value date
peak_date = df.loc[(df['entity'] == entity) & (df['value'] == peak_value), 'month'].min()
# subtract peak date from current date
delta = df.loc[df['entity'] == entity, 'month'] - peak_date
# update days_delta with delta in days
df.loc[df['entity'] == entity, 'days_delta'] = delta
RESULT:
entity month value month_number days_delta
A 2018-10-31 80 1 0 days
A 2018-11-30 600 2 30 days
A 2018-12-31 500 3 61 days
A 2019-01-31 400 4 92 days
B 2009-01-31 150 1 -28 days
B 2009-02-28 300 2 0 days
B 2009-03-31 100 3 31 days
C 2011-08-31 200 1 -61 days
C 2011-09-30 250 2 -31 days
C 2011-10-31 300 3 0 days
C 2011-11-30 200 4 30 days
C 2011-12-31 175 5 61 days
Setup
First let's also make sure value is numeric
df = pd.DataFrame({
'entity':['A','A','A','A','B','B','B','C','C','C','C','C'],
'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019',
'1/31/2009','2/28/2009','3/31/2009','8/31/2011',
'9/30/2011','10/31/2011','11/30/2011','12/31/2011'],
'value':['80','600','500','400','150','300','100','200','250','300','200','175'],
'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']
})
df['month'] = df['month'].apply(pd.to_datetime)
df['value'] = pd.to_numeric(df['value'])
transform and idxmax
max_months = df.groupby('entity').value.transform('idxmax').map(df.month)
df.assign(days_delta=df.month - max_months)
entity month value month_number days_delta
0 A 2018-10-31 80 1 -30 days
1 A 2018-11-30 600 2 0 days
2 A 2018-12-31 500 3 31 days
3 A 2019-01-31 400 4 62 days
4 B 2009-01-31 150 1 -28 days
5 B 2009-02-28 300 2 0 days
6 B 2009-03-31 100 3 31 days
7 C 2011-08-31 200 1 -61 days
8 C 2011-09-30 250 2 -31 days
9 C 2011-10-31 300 3 0 days
10 C 2011-11-30 200 4 30 days
11 C 2011-12-31 175 5 61 days

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
would output
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
as is, df.resample('2D', how='sum') gets me
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
pd.Grouper
allows you to specify a "groupby instruction for a target object". In
particular, you can use it to group by dates even if df.index is not a DatetimeIndex:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper turns out to be much faster than my original
suggestion, using_reset_index:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
yields
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.
To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.
This should give you the DataFrame you need:
df.groupby([pd.Grouper(level='State'),
pd.Grouper(level='City'),
pd.Grouper(level='Date', freq='2D')]
).sum()
The Time Series Guide in the pandas documentation describes resample() as:
... a time-based groupby, followed by a reduction method on each of its groups.
Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.
The same paragraph points to the cookbook section on resampling for more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.
An alternative using stack/unstack
df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
Alabama Mobile 2012-01-01 17 37
Montgomery 2012-01-01 25 45
Georgia Savanna 2012-01-01 9 29
Atlanta 2012-01-03 5 25
Alabama Mobile 2012-01-03 21 41
Montgomery 2012-01-03 29 49
Georgia Savanna 2012-01-03 13 33
Notes:
No idea about performance comparison
Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed
This works:
df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
If the Date column is strings, then convert to datetime beforehand:
df['Date'] = pd.to_datetime(df['Date'])
I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample function in the 0.19.2 docs, and I see there's a new kwarg called "level" that you can use to specify a level in a MultiIndex.
Edit: More details in the "What's New" section.
I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:
>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-07-01 1
2012-07-02 13
2012-07-03 1
2012-07-04 1
2012-07-05 10
2012-07-06 4
2012-07-07 47
2012-07-08 0
2012-07-09 3
2012-07-10 22
2012-07-11 3
2012-07-12 0
2012-07-13 22
2012-07-14 1
2012-07-15 2
2012-07-16 2
2012-07-17 8
2012-07-18 0
2012-07-19 1
2012-07-20 10
2012-07-21 0
2012-07-22 3
2012-07-23 0
2012-07-24 35
2012-07-25 6
2012-07-26 1
2012-07-27 0
2012-07-28 6
2012-07-29 23
2012-07-30 0
..
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2014-06-02 0
2014-06-03 1
2014-06-04 0
2014-06-05 0
2014-06-06 0
2014-06-07 0
2014-06-08 2
2014-06-09 0
2014-06-10 0
2014-06-11 0
2014-06-12 0
2014-06-13 0
2014-06-14 0
2014-06-15 0
2014-06-16 0
2014-06-17 0
2014-06-18 0
2014-06-19 0
2014-06-20 0
2014-06-21 0
2014-06-22 0
2014-06-23 0
2014-06-24 0
2014-06-25 4
2014-06-26 0
2014-06-27 1
2014-06-28 0
2014-06-29 0
2014-06-30 1
2014-07-01 0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1 yyyyyyyyyyyyyyyyyyyyyy1 2012-06-25/2012-07-01 1
2012-07-02/2012-07-08 76
2012-07-09/2012-07-15 53
2012-07-16/2012-07-22 24
2012-07-23/2012-07-29 71
2012-07-30/2012-08-05 38
2012-08-06/2012-08-12 258
2012-08-13/2012-08-19 144
2012-08-20/2012-08-26 184
2012-08-27/2012-09-02 323
2012-09-03/2012-09-09 198
2012-09-10/2012-09-16 348
2012-09-17/2012-09-23 404
2012-09-24/2012-09-30 380
2012-10-01/2012-10-07 367
2012-10-08/2012-10-14 163
2012-10-15/2012-10-21 338
2012-10-22/2012-10-28 252
2012-10-29/2012-11-04 197
2012-11-05/2012-11-11 336
2012-11-12/2012-11-18 234
2012-11-19/2012-11-25 143
2012-11-26/2012-12-02 204
2012-12-03/2012-12-09 296
2012-12-10/2012-12-16 146
2012-12-17/2012-12-23 85
2012-12-24/2012-12-30 198
2012-12-31/2013-01-06 214
2013-01-07/2013-01-13 229
2013-01-14/2013-01-20 192
...
xxxxxxN yyyyyyyyyyyyyyyyyyyyyyN 2013-12-09/2013-12-15 3
2013-12-16/2013-12-22 0
2013-12-23/2013-12-29 0
2013-12-30/2014-01-05 1
2014-01-06/2014-01-12 3
2014-01-13/2014-01-19 6
2014-01-20/2014-01-26 11
2014-01-27/2014-02-02 0
2014-02-03/2014-02-09 1
2014-02-10/2014-02-16 4
2014-02-17/2014-02-23 3
2014-02-24/2014-03-02 1
2014-03-03/2014-03-09 4
2014-03-10/2014-03-16 0
2014-03-17/2014-03-23 0
2014-03-24/2014-03-30 9
2014-03-31/2014-04-06 1
2014-04-07/2014-04-13 1
2014-04-14/2014-04-20 1
2014-04-21/2014-04-27 2
2014-04-28/2014-05-04 8
2014-05-05/2014-05-11 7
2014-05-12/2014-05-18 5
2014-05-19/2014-05-25 2
2014-05-26/2014-06-01 8
2014-06-02/2014-06-08 3
2014-06-09/2014-06-15 0
2014-06-16/2014-06-22 0
2014-06-23/2014-06-29 5
2014-06-30/2014-07-06 1
dtype: int64
ts.unstack().T.resample('W', how='sum').T.stack() is all it took! Very easy and seems quite performant. The pickle I'm reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.
I haven't checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual "split-apply-combine" process using a dictionary comprehension.
Assuming your DataFrame is unindexed. (You can do .reset_index() first), this works as follows:
Group by the non-date columns
Set "Date" as index and resample each chunk
Reassemble using pd.concat
The final code looks like:
pd.concat({g: x.set_index("Date").resample("2D").mean()
for g, x in house.groupby(["State", "City"])})
I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):
Step 1: resample the date but that would give you the date without the other index :
new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()
That would give you the date and its count
Step 2: get the categorical index in the same order as the the date :
col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]
That would give you a new column with the city names and in the same order as the date.
Step 3: merge the dataframes together
new_df=pd.concat([new, col], axis=1)
It's pretty simple, you can make it really shorter tho.

Categories

Resources