Perform multiple operations in a single groupby call with pandas? - python

I'd like to produce a summary dataframe after grouping by date. I want to have a column that shows the mean of a given column as it is and the mean of that same column after filtering for instances that are greater than 0. I figured out how I can do this (below), but it requires doing two separate groupby calls, renaming the columns, and then joining them back together. I fell like one should be able to do this all in one call. I was trying to use eval to do this but kept getting an error and being told to use apply, that I couldn't use eval on a groupby object.
Code which gets me what I want but doesn't seem very efficient:
# Sample data
data = pd.DataFrame(
{"year" : [2013, 2013, 2013, 2014, 2014, 2014],
"month" : [1, 2, 3, 1, 2, 3],
"day": [1, 1, 1, 1, 1, 1],
"delay": [0, -4, 50, -60, 9, 10]})
subset = (data
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay'})
)
subset_1 = (data[data.delay > 0]
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay_pos'})
)
combined = pd.merge(subset, subset_1, how='left', on=['year', 'month', 'day'])
combined
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0

IIUC, you could use the following code:
>>> data['avg_delay'] = data.pop('delay')
>>> data['avg_delay_pos'] = data.loc[data['avg_delay'].gt(0), 'avg_delay']
>>> data
day month year avg_delay avg_delay_pos
0 1 1 2013 0 NaN
1 1 2 2013 -4 NaN
2 1 3 2013 50 50.0
3 1 1 2014 -60 NaN
4 1 2 2014 9 9.0
5 1 3 2014 10 10.0
>>>
Explanation:
I first remove the delay column, and assign it to the new name of avg_delay, so I am virtually renaming the name of delay to avg_delay.
Then I create a new column called avg_delay_pos, which first uses loc to get the values greater than zero, and since the index doesn't reset, so it will make the indexes that are greater than zero to the values of avg_delay, and the others won't contain any assignments, that said they will be NaN as you expected.

The solution is specific to your problem, but you can do this using a single groupby call. To get "avg_delay_pos", you just have to remove negative (and zero) values.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
(df.filter(like='delay')
.groupby(pd.to_datetime(df[['year', 'month', 'day']]))
.mean()
.add_prefix('avg_'))
avg_delay avg_delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Breakdown
where is used to mask values that are not positive.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
# df['delay'].where(df['delay'] > 0)
0 NaN
1 NaN
2 50.0
3 NaN
4 9.0
5 10.0
Name: delay, dtype: float64
Next, extract the delay columns we want to group on,
df.filter(like='delay')
delay delay_pos
0 0 NaN
1 -4 NaN
2 50 50.0
3 -60 NaN
4 9 9.0
5 10 10.0
Then perform a groupby on the date,
_.groupby(pd.to_datetime(df[['year', 'month', 'day']])).mean()
delay delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Where pd.to_datetime is used to convert the year/month/day columns into a single datetime column, it's more efficient to group on a single column than multiple.
pd.to_datetime(df[['year', 'month', 'day']])
0 2013-01-01
1 2013-02-01
2 2013-03-01
3 2014-01-01
4 2014-02-01
5 2014-03-01
dtype: datetime64[ns]
The final .add_prefix('avg_') add prefix "_avg" to the result.
An alternative way to do this if you want separate year/month/day columns would be
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
df.groupby(['year', 'month', 'day']).mean().add_prefix('avg_').reset_index()
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0

Related

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)

Faster way to construct a multiindex of dates in Pandas

I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)
Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

Reindexing after a pivot in pandas

Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():
convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)

(pandas) Fill NaN based on groupby and column condition

Using 'bfill' or 'ffill' on a groupby element is trivial, but what if you need to fill the na with a specific value in a second column, based on a condition in a third column?
For example:
>>> df=pd.DataFrame({'date':['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'], 'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
>>> df
a b date
0 1 4.0 01/10/2017
1 1 NaN 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 NaN 01/11/2017
5 2 7.0 02/10/2016
I need to group by column 'a', and fill the NaN with the column 'b' value where the date for that row is closest to the date in the NaN row.
So the output should look like:
a b date
0 1 4.0 01/10/2017
1 1 6.0 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 5.0 01/11/2017
5 2 7.0 02/10/2016
Assume there is a closest_date() function that takes the NaN date and the list of other dates in that group, and returns the closest date.
I'm trying to find a clean solution that doesn't have to iterate through rows, ideally able to use apply() with lambdas. Any ideas?
This should work:
df['closest_date_by_a'] = df.groupby('a')['date'].apply(closest_date)
df['b'] = df.groupby(['a', 'closest_date_by_a'])['b'].ffill().bfill()
Given a function (closest_date()), you need to apply that function by group so it calculates the closest dates for rows within each group. Then you can group by both the main grouping column (a) and the closest date column (closest_date_by_a) and perform your filling.
Ensure that your date column are in fact dates.
df = pd.DataFrame(
{'date': ['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'],
'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
df.date = pd.to_datetime(df.date)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 NaN 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 NaN 2017-01-11
5 2 7.0 2016-02-10
Use reindex with method='nearest' after having dropna()
def fill_with_nearest(df):
s = df.set_index('date').b
s = s.dropna().reindex(s.index, method='nearest')
s.index = df.index
return s
df.loc[df.b.isnull(), 'b'] = df.groupby('a').apply(fill_with_nearest).reset_index(0, drop=True)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 4.0 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 5.0 2017-01-11
5 2 7.0 2016-02-10

Categories

Resources