Group by Year and Month Panda Pivot Table - python

I have data like this
Date LoanOfficer User_Name Loan_Number
0 2017-11-30 00:00:00 Mark Evans underwriterx 1100000293
1 2017-11-30 00:00:00 Kimberly White underwritery 1100004947
2 2017-11-30 00:00:00 DClair Phillips underwriterz 1100007224
I've created df pivot table like this:
pd.pivot_table(df,index=["User_Name","LoanOfficer"],
values=["Loan_Number"],
aggfunc='count',fill_value=0,
columns=["Date"]
)
However I need the Date column to be grouped by Year and Month. I was looking at other solutions of resampling the dataframe and then applying the pivot but it only does it for Month and Days. Any help would be appreciated

You can convert you Date column to %Y-%m , then do the pivot_table
df.Date=pd.to_datetime(df.Date)
df.Date=df.Date.dt.strftime('%Y-%m')
df
Out[143]:
Date LoanOfficer User_Name Loan_Number
0 2017-11 Mark Evans underwriterx 1100000293
1 2017-11 Kimberly White underwritery 1100004947
2 2017-11 DClair Phillips underwriterz 1100007224
pd.pivot_table(df,index=["User_Name","LoanOfficer"],
values=["Loan_Number"],
aggfunc='count',fill_value=0,
columns=["Date"]
)
Out[144]:
Loan_Number
Date 2017-11
User_Name LoanOfficer
underwriterx Mark Evans 1
underwritery Kimberly White 1
underwriterz DClair Phillips 1

Related

Calculate Percent Change Between Rows in Pandas Grouped by Another Column

I am somewhat new to Pandas and I have been stuck on a problem.
Assume I have the following dataframe (df1):
Name
Day
Score
Al
Monday
75
Al
Friday
88
Bo
Monday
90
Bo
Friday
100
Cy
Monday
85
Cy
Friday
95
I would like to create another dataframe (df2) with each person's name and their percent improvement from Monday to Friday.
The result would be:
Name
Improvement
Al
17.33
Bo
11.11
Cy
11.76
For example, Al improved by 17.33% between Monday and Friday (((88-75)/75) * 100)
If there is for each Name always ordered Monday and Friday like in sample data solution is GroupBy.pct_change:
df = (df[['Name']].join(df.groupby('Name')['Score'].pct_change().mul(100)
.rename('Improvement'))
.dropna())
print (df)
Name Improvement
1 Al 17.333333
3 Bo 11.111111
5 Cy 11.764706
Let us pivot to reshape then calculate pct change along column axis
s = df.pivot('Name', 'Day', 'Score')
s = s.pct_change(-1, axis=1)['Friday'].reset_index(name='Improvement')
Result
Name Improvement
0 Al 0.173333
1 Bo 0.111111
2 Cy 0.117647

dataframe pivot based on 3 columns

I have a data frame like shown below
customer
organization
currency
volume
revenue
Duration
Peter
XYZ Ltd
CNY, INR
20
3,000
01-Oct-2022
John
abc Ltd
INR
7
184
01-Oct-2022
Mary
aaa Ltd
USD
3
43
03-Oct-2022
John
bbb Ltd
THB
17
2,300
04-Oct-2022
Dany
ccc Ltd
CNY, INR , KRW
45
15,100
04-Oct-2022
If I pivot as shown below
df = pd.pivot_table(df, values=['runs', 'volume','revenue'],
index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0
)
level = 0 becomes volume for all Duration (level 1) revenue for all Duration duration for all Duration.
I would like to pivot by Duration as level 0 and volume, revenue as level 2.
How to achieve it?
Current output:
I would like to have date as level 0 and volume, revenue and runs under it.
You can use swaplevel like below in your current pivot code; try this;
df1 = df.pivot_table(index=['customer', 'organization', 'currency'],
columns=['Duration'],
aggfunc=sum,
fill_value=0).swaplevel(0,1, axis=1).sort_index(axis=1)
Hope this Helps...

Pandas - Merge nearly duplicate rows filtering the last timestamp

I have a two pandas dataframe with several rows that are near duplicates of each other, except for one value, which is timestamp value. My goal is to merge these dataframes into a single dataframe, and for these nearly repeat rows, get the row with the last timestamp.
Here is an example of what I'm working with:
DF1:
id name created_at
0 1 Cristiano Ronaldo 2020-01-20
1 2 Messi 2020-01-20
2 3 Juarez 2020-01-20
DF2:
id name created_at
0 1 Cristiano Ronaldo 2020-01-20
1 2 Messi 2020-01-20
2 3 Juarez 2020-02-20
And here is what I would like:
id name created_at
3 1 Cristiano Ronaldo 2020-01-20
4 2 Messi 2020-01-20
5 3 Juarez 2020-02-20
For the row Juarez I get the last "created_ad"
Tha is it possible?
You can append the second dataframe to the first one, sort the dataframe using timestamp and then drop duplicates.
df_merged = df1.append(df2, ignore_index = True)
df_merged = df_merged.sort_values('created_at')
df_columns = df_merged.columns.tolist()
df_columns.remove('created_at')
df_merged.drop_duplicates(inplace = True, keep = 'last', subset = df_columns)

Transforming dataframe to track the changes

i have some students data and the subjects they have elected.
id name date from date to Subjectname note
1188 Cera 01-08-2016 30-09-2016 math approved
1188 Cera 01-10-2016 elec
1199 ron 01-06-2017 english app-true
1288 Snow 01-01-2017 tally
1433 sansa 25-01-2016 14-07-2016 tally
1433 sansa 15-07-2016 16-01-2017 tally relected
1844 amy 01-10-2016 10-11-2017 adv
1522 stark 01-01-2016 phy
1722 sid 01-06-2017 31-03-2018 history
1722 sid 01-04-2018 history as per request
1844 amy 01-01-2016 30-09-2016 science
2100 arya 01-08-2016 30-09-2016 english
2100 arya 01-10-2016 31-05-2017 math taken
2100 arya 01-06-2017 english
I am looking for outpur like:
id name from to subject from subject to
1188 Cera 01-08-2016 01-10-2016 math elec
1199 ron 01-06-2017 english
1288 Snow 01-01-2017 tally
1433 sansa 25-01-2016 16-01-2017 tally tally
1522 stark 01-01-2016 phy
1722 sid 01-06-2017 01-04-2018 history history
1844 amy 01-01-2016 10-11-2017 science adv
2100 arya 01-08-2016 31-05-2017 english math
2100 arya 01-06-2017 math english
column 'from' has the minimum date value corresponding to the name.
column 'to' has the maximum date value corresponding to the name.
column 'subject from' has the 'Subjectname' value corresponding to the column 'from' and 'name'.
column 'subject to' has the 'Subjectname' value corresponding to the column 'to' and 'name'.
i need to track the transaction made by student and the subjectname they changed (subject from and subject to).
Please let me know how to achieve this.
or please let me know if there is an easy way to get the an output which contains transaction details per student and the subject they changed.
Use DataFrameGroupBy.agg with set_index by column Subjectname, so is possible use idxmin and
idxmax for subject by minimal and maximal datetimes per groups:
df['date from'] = pd.to_datetime(df['date from'])
df['date to'] = pd.to_datetime(df['date to'])
d = {'date from':['min', 'idxmin'], 'date to':['max', 'idxmax']}
df1 = df.set_index('Subjectname').groupby(['id','name']).agg(d)
df1.columns = df1.columns.map('_'.join)
d1 = {'date from_min':'from','date to_max':'to',
'date from_idxmin':'subject from','date to_idxmax':'subject to'}
cols = ['from','to','subject from','subject to']
df1 = df1.rename(columns=d1).reindex(columns=cols).reset_index()
print (df1)
id name from to subject from subject to
0 1188 Cera 2016-01-08 2016-09-30 math math
1 1199 ron 2017-01-06 NaT english NaN
2 1288 Snow 2017-01-01 NaT tally NaN
3 1433 sansa 2016-01-25 2017-01-16 tally tally
4 1522 stark 2016-01-01 NaT phy NaN
5 1722 sid 2017-01-06 2018-03-31 history history
6 1844 amy 2016-01-01 2017-10-11 science adv
7 2100 arya 2016-01-08 2017-05-31 english math
my df from your first 3 rows, it should be ok to demo how to do this.
df:
id name date_from date_to subject_name note
0 1188 Cera 2016-01-08 30-09-2016 math approved
1 1188 Cera 2016-01-10 elec
2 1199 ron 2017-01-06 english app-true
just paste code here.
# make date from and date to to one column to get max and min date
df1 = df[['id', 'name', 'date_from', 'subject_name', 'note']]
df2 = df[['id', 'name', 'date_to', 'subject_name', 'note']]
df3 = pd.concat([df1,df2])
df1.columns = ['id', 'name', 'date', 'subject_name', 'note']
df2.columns = ['id', 'name', 'date', 'subject_name', 'note']
df3 = pd.concat([df1,df2])
df3['date'] = pd.to_datetime(df3['date'])
df3 = df3.dropna()
df3:
id name date subject_name note
0 1188 Cera 2016-01-08 math approved
1 1188 Cera 2016-01-10 elec
2 1199 ron 2017-01-06 english app-true
0 1188 Cera 2016-09-30 math approved
#here you get from and to date for each name
df4 = df3.groupby('name').agg({'date':[max,min]})
df4.columns = ['to','from']
df4 = df4.reset_index()
df4:
name to from
0 Cera 2016-09-30 2016-01-08
1 ron 2017-01-06 2017-01-06
# match "name" and "to" in df4 with "name" and "date" in df3, you got the earliest subject and latest
df_sub_from = pd.merge(df4,df3,how='left',left_on=['name','to'],right_on=['name','date'])
df_sub_from
df_sub_to = pd.merge(df4,df3,how='left',left_on=['name','to'],right_on=['name','date'])
df_sub_from = pd.merge(df4,df3,how='left',left_on=['name','from'],right_on=['name','date'])
#remove unneed columns
df_sub_from = df_sub_from[['id','name','from','to','subject_name']]
df_sub_to = df_sub_to[['id','name','from','to','subject_name']]
# merge together and rename nicely
df_final = pd.merge(df_sub_from,df_sub_to,left_on=['id','name','from','to'],right_on=['id','name','from','to'])
df_final.columns = ['id','name','from','to','subject_from','subject_to']
here it is:
id name from to subject_from subject_to
0 1188 Cera 2016-01-08 2016-09-30 math math
1 1199 ron 2017-01-06 2017-01-06 english english

Combine two pandas DataFrames where the date fields are within two months of each other

I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0

Categories

Resources