Aggregation in pandas dataframe with columns names in one row

Aggregation in pandas dataframe with columns names in one row - python

I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.
df = pd.DataFrame({'ID':[1,1,2,2,2],
'revenue':[1,3,5,1,5],
'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
ID month revenue
0 1 2012-01-01 1
1 1 2012-01-01 3
2 2 2012-03-01 5
3 2 2014-01-01 1
4 2 2012-01-01 5
Doing the aggregation below.
df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
ID revenue month
sum distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
Desired output is:
ID revenue distinct_m month
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:
ID revenue
0 1 4
1 2 11
But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.
Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.

For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:
df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
distinct_m=('month','nunique'),
month_m = ('month','first')
).reset_index())
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
For lower pandas versions is possible flatten columns in MultiIndex and then rename:
df = df.groupby(['ID']).agg({'revenue':'sum',
'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
'month_distinct_m':'distinct_m',
'month_month_m':'month_m'})
df = df.reset_index()
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01

Related

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?

You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True

Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

How to drop records based on number of unique days using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-03 13:39:00','2173-07-04 11:30:00','2173-04-04 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What I would like to do is drop records/subjects who doesn't have more than 4 or more unique days
If you see my sample dataframe, you can see that subject_id = 1 has only 3 unique dates which is 3,4 and 5 so I would like to drop subject_id = 1 completely. But if you see subject_id = 2 he has more than 4 unique dates like 4,9,11,13,14. Please note that date values has timestamp, hence I extract the day from each datetime field and check for unique records.
This is what I tried
df.groupby(['subject_id','day']).transform('size')>4 # doesn't work
df[df.groupby(['subject_id','day'])['subject_id'].transform('size')>=4] # doesn't produce expected output
I expect my output to be like this

Change your function from size to DataFrameGroupBy.nunique, grouping only by the subject_id column:
df = df[df.groupby('subject_id')['day'].transform('nunique')>=4]
Or alternatively you can use filtration, but this should be slower if you're using a larger dataframe or many unique groups:
df = df.groupby('subject_id').filter(lambda x: x['day'].nunique()>=4)
print (df)
subject_id time_1 val day month
7 2 2173-04-04 16:00:00 5 4 4
8 2 2173-04-09 22:00:00 8 9 4
9 2 2173-04-11 04:00:00 3 11 4
10 2 2173-04-13 04:30:00 4 13 4
11 2 2173-04-14 08:00:00 6 14 4

Delete rows of Dataframe based on multiple conditions from different Dataframe

I have two large Dataframes. The first one contains data, consisting of a date column and a location column, followed by several data column. The second DataFrame consists of a date column and a location column. I want to remove all the rows where the date and the location of df1 match df2.
I have tried a few ways to fix this, including drop statements, drop statements within for loops and redefining the dataframe based on multiple conditions. They all don't work
date = pd.to_datetime(['2019-01-01','2019-01-01','2019-01-02','2019-01-02','2019-01-03','2019-01-03'],format='%Y-%m-%d')
location = [1,2,1,2,1,2]
nr = [8,10,15,2,20,38]
df1 = pd.DataFrame(columns=['date','location','nr'])
df1['date']=date
df1['location']=location
df1['nr']=nr
this results in the following dataframe:
date location nr
0 2019-01-01 1 8
1 2019-01-01 2 10
2 2019-01-02 1 15
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
the second dataframe:
date2 = pd.to_datetime(['2019-01-01','2019-01-02'],format='%Y-%m-%d')
location2 = [2,1]
df2 = pd.DataFrame(columns=['date','location'])
df2['date']=date2
df2['location']=location2
resulting in the following dataframe:
date location
0 2019-01-01 2
1 2019-01-02 1
then the drop statement:
for i in range(len(df2)):
dayA = df2['date'].iloc[i]
placeA = df2['location'].iloc[i]
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
which results in this case in the error code in the example :
KeyError: "['date' 'location' 'nr'] not found in axis"
However in my larger dataframe it results in the error:
TypeError: 'NoneType' object is not iterable
what I need however is
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
what am I doing wrong

df1.loc[(df1['date']==dayA)& (df1['location']==placeA)] is the dataframe consisting of rows where the date and location match. drop is expecting the index where they match. So you need df1.loc[(df1['date']==dayA)& (df1['location']==placeA)].index. However, this is a very inefficient method. You can use merge instead as the other answers discuss. Another method would be df1 = df1.loc[~df1[['date','location']].apply(tuple,axis=1).isin(zip(df2.date,df2.location))].

I would use pandas merge and a little trick:
df2['temp'] = 2
df = pd.merge(df1, df2, how='outer', on=['date', 'location'])
df = df[pd.isna(df.temp)]
del df['temp']

Problem is with this line:
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
You can achieve your target like this:
df1 = df1.loc[~((df1['date']==dayA) & (df1['location']==placeA))]
Basically, everytime you find a match for each row, you essentially remove it from the df1 dataframe.
Output:
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38

Use pandas merge:
This should work
df1['index_col'] = df1.index
df = df1.merge(df2,on=['date','location'],how='left')
df = df.dropna()
df = df[df1.columns]
result_df = df1[~df.index_col.isin(df1.index_col)]

Accessing different columns from DataFrame in transform

I want to write a transformation function accessing two columns from a DataFrame and pass it to transform().
Here is the DataFrame which I would like to modify:
print(df)
date increment
0 2012-06-01 0
1 2003-04-08 1
2 2009-04-22 3
3 2018-05-24 6
4 2006-09-25 2
5 2012-11-02 4
I would like to increment the year in column date by the number of years given variable increment. The proposed code (which does not work) is:
df.transform(lambda df: date(df.date.year + df.increment, 1, 1))
Is there a way to access individual columns in the function (here a lambda function) passed to transform()?

You can use pandas.to_timedelta :
# If necessary convert to date type first
# df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'] + pd.to_timedelta(df['increment'], unit='Y')
[out]
date increment
0 2012-06-01 00:00:00 0
1 2004-04-07 05:49:12 1
2 2012-04-21 17:27:36 3
3 2024-05-23 10:55:12 6
4 2008-09-24 11:38:24 2
5 2016-11-01 23:16:48 4
or alternatively:
df['date'] = pd.to_datetime({'year': df.date.dt.year.add(df.increment),
'month': df.date.dt.month,
'day': df.date.dt.day})
[out]
date increment
0 2012-06-01 0
1 2004-04-08 1
2 2012-04-22 3
3 2024-05-24 6
4 2008-09-25 2
5 2016-11-02 4
Your own solution could also be fixed by instead using the apply method and passing the axis=1 argument:
from datetime import date
df.apply(lambda df: date(df.date.year + df.increment, 1, 1), axis=1)

How to efficiently add rows for those data points which are missing from a sequence using pandas?

I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.

You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregation in pandas dataframe with columns names in one row - python

Related

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

How to drop records based on number of unique days using pandas?

Delete rows of Dataframe based on multiple conditions from different Dataframe

Accessing different columns from DataFrame in transform

How to efficiently add rows for those data points which are missing from a sequence using pandas?

Categories

Resources