Transforming dataframe to track the changes - python

i have some students data and the subjects they have elected.
id name date from date to Subjectname note
1188 Cera 01-08-2016 30-09-2016 math approved
1188 Cera 01-10-2016 elec
1199 ron 01-06-2017 english app-true
1288 Snow 01-01-2017 tally
1433 sansa 25-01-2016 14-07-2016 tally
1433 sansa 15-07-2016 16-01-2017 tally relected
1844 amy 01-10-2016 10-11-2017 adv
1522 stark 01-01-2016 phy
1722 sid 01-06-2017 31-03-2018 history
1722 sid 01-04-2018 history as per request
1844 amy 01-01-2016 30-09-2016 science
2100 arya 01-08-2016 30-09-2016 english
2100 arya 01-10-2016 31-05-2017 math taken
2100 arya 01-06-2017 english
I am looking for outpur like:
id name from to subject from subject to
1188 Cera 01-08-2016 01-10-2016 math elec
1199 ron 01-06-2017 english
1288 Snow 01-01-2017 tally
1433 sansa 25-01-2016 16-01-2017 tally tally
1522 stark 01-01-2016 phy
1722 sid 01-06-2017 01-04-2018 history history
1844 amy 01-01-2016 10-11-2017 science adv
2100 arya 01-08-2016 31-05-2017 english math
2100 arya 01-06-2017 math english
column 'from' has the minimum date value corresponding to the name.
column 'to' has the maximum date value corresponding to the name.
column 'subject from' has the 'Subjectname' value corresponding to the column 'from' and 'name'.
column 'subject to' has the 'Subjectname' value corresponding to the column 'to' and 'name'.
i need to track the transaction made by student and the subjectname they changed (subject from and subject to).
Please let me know how to achieve this.
or please let me know if there is an easy way to get the an output which contains transaction details per student and the subject they changed.

Use DataFrameGroupBy.agg with set_index by column Subjectname, so is possible use idxmin and
idxmax for subject by minimal and maximal datetimes per groups:
df['date from'] = pd.to_datetime(df['date from'])
df['date to'] = pd.to_datetime(df['date to'])
d = {'date from':['min', 'idxmin'], 'date to':['max', 'idxmax']}
df1 = df.set_index('Subjectname').groupby(['id','name']).agg(d)
df1.columns = df1.columns.map('_'.join)
d1 = {'date from_min':'from','date to_max':'to',
'date from_idxmin':'subject from','date to_idxmax':'subject to'}
cols = ['from','to','subject from','subject to']
df1 = df1.rename(columns=d1).reindex(columns=cols).reset_index()
print (df1)
id name from to subject from subject to
0 1188 Cera 2016-01-08 2016-09-30 math math
1 1199 ron 2017-01-06 NaT english NaN
2 1288 Snow 2017-01-01 NaT tally NaN
3 1433 sansa 2016-01-25 2017-01-16 tally tally
4 1522 stark 2016-01-01 NaT phy NaN
5 1722 sid 2017-01-06 2018-03-31 history history
6 1844 amy 2016-01-01 2017-10-11 science adv
7 2100 arya 2016-01-08 2017-05-31 english math

my df from your first 3 rows, it should be ok to demo how to do this.
df:
id name date_from date_to subject_name note
0 1188 Cera 2016-01-08 30-09-2016 math approved
1 1188 Cera 2016-01-10 elec
2 1199 ron 2017-01-06 english app-true
just paste code here.
# make date from and date to to one column to get max and min date
df1 = df[['id', 'name', 'date_from', 'subject_name', 'note']]
df2 = df[['id', 'name', 'date_to', 'subject_name', 'note']]
df3 = pd.concat([df1,df2])
df1.columns = ['id', 'name', 'date', 'subject_name', 'note']
df2.columns = ['id', 'name', 'date', 'subject_name', 'note']
df3 = pd.concat([df1,df2])
df3['date'] = pd.to_datetime(df3['date'])
df3 = df3.dropna()
df3:
id name date subject_name note
0 1188 Cera 2016-01-08 math approved
1 1188 Cera 2016-01-10 elec
2 1199 ron 2017-01-06 english app-true
0 1188 Cera 2016-09-30 math approved
#here you get from and to date for each name
df4 = df3.groupby('name').agg({'date':[max,min]})
df4.columns = ['to','from']
df4 = df4.reset_index()
df4:
name to from
0 Cera 2016-09-30 2016-01-08
1 ron 2017-01-06 2017-01-06
# match "name" and "to" in df4 with "name" and "date" in df3, you got the earliest subject and latest
df_sub_from = pd.merge(df4,df3,how='left',left_on=['name','to'],right_on=['name','date'])
df_sub_from
df_sub_to = pd.merge(df4,df3,how='left',left_on=['name','to'],right_on=['name','date'])
df_sub_from = pd.merge(df4,df3,how='left',left_on=['name','from'],right_on=['name','date'])
#remove unneed columns
df_sub_from = df_sub_from[['id','name','from','to','subject_name']]
df_sub_to = df_sub_to[['id','name','from','to','subject_name']]
# merge together and rename nicely
df_final = pd.merge(df_sub_from,df_sub_to,left_on=['id','name','from','to'],right_on=['id','name','from','to'])
df_final.columns = ['id','name','from','to','subject_from','subject_to']
here it is:
id name from to subject_from subject_to
0 1188 Cera 2016-01-08 2016-09-30 math math
1 1199 ron 2017-01-06 2017-01-06 english english

Related

Make a wide dataframe long and add columns according to another column's name

I need to use some names of the columns as part of the df. While keeping the first 3 columns identical, I need to create some other columns based on the content of the row.
Here I have some transactions from some customers:
cust_id cust_first cust_last au_zo au_zo_pay fi_gu fi_gu_pay wa wa_pay
0 1000 Andrew Jones 50.85 debit NaN NaN 69.12 debit
1 1001 Fatima Lee NaN NaN 18.16 debit NaN NaN
2 1002 Sophia Lewis NaN NaN NaN NaN 159.54. credit
3 1003 Edward Bush 45.29 credit 59.63 credit NaN NaN
4 1004 Mark Nunez 20.87 credit 20.87 credit 86.18 debit
First, I need to add a new column, 'city'. Since it is not on the database. It is defaulted to be 'New York'. (that's easy!)
But here is where I am getting stuck:
Add a new column 'store' holds values according to where a transaction took place. au_zo --> autozone, fi_gu --> five guys, wa --> walmart
Add new column 'classification' according to the store previously added: auto zone --> auto-repair, five guys --> food, walmart --> groceries
Column 'amount' holds the value of the customer and store.
Column 'transaction_type' is the value of au_zo_pay, fi_gu_pay, wa_pay respectively.
So at the end it looks like this:
cust_id city cust_first cust_last store classification amount trans_type
0 1000 New York  Andrew Jones auto zone auto-repair 50.85 debit
1 1000 New York Andrew Jones walmart groceries 69.12 debit
2 1001 New York Fatima Lee five guys food 18.16 debit
3 1002 New York Sophia Solis walmart groceries 159.54 credit
4 1003 New York Edward Bush auto zone auto-repair 45.29 credit
5 1003 New York Edward Bush five guys food 59.63 credit
6 1004 New York Mark Nunez auto zone auto-repair 20.87 credit
7 1004 New York Mark Nunez five guys food 20.87 credit
8 1004 New York Mark Nunez walmart groceries 86.18 debit
I have tried using df.melt() but I don't get the results.
Is this something you want?
import pandas as pd
mp = {
'au_zo': 'auto-repair',
'wa':'groceries',
'fi_gu':'food'
}
### Read txt Data: get pandas df
# I copied and pasted your sample data to a txt file, you can ignore this part
with open(r"C:\Users\orf-haoj\Desktop\test.txt", 'r') as file:
head, *df = [row.split() for row in file.readlines()]
df = [row[1:] for row in df]
df = pd.DataFrame(df, columns=head)
### Here we conduct 2 melts to form melt_1 & melt_2 data
# this melt table is to melt cols 'au_zo','fi_gu', and 'wa'. & get amount as value
melt_1 = df.melt(id_vars=['cust_id', 'cust_first', 'cust_last'], value_vars=['au_zo','fi_gu','wa'], var_name='store', value_name='amount')
# this melt table is to melt cols ['au_zo_pay','fi_gu_pay','wa_pay']. & get trans_type cols
melt_2 = df.melt(id_vars=['cust_id', 'cust_first', 'cust_last'], value_vars=['au_zo_pay', 'fi_gu_pay', 'wa_pay'], var_name='store pay', value_name='trans_type')
# since I want to join these table later, it will a good to get one more key store
melt_2['store'] = melt_2['store pay'].apply(lambda x: '_'.join(x.split("_")[:-1]))
### Remove NaN
# you prob want to switch to test = test.loc[~test['amount'].isnull()] or something else if you have actual nan
melt_1 = melt_1.loc[melt_1['amount'] != 'NaN']
melt_2 = melt_2.loc[melt_2['trans_type'] != 'NaN']
### Inner join data based on 4 keys (assuming your data will have one to one relationship based on these 4 keys)
full_df = melt_1.merge(melt_2, on=['cust_id', 'cust_first', 'cust_last', 'store'], how='inner')
full_df['city'] = 'New York'
full_df['classification'] = full_df['store'].apply(lambda x: mp[x])
In addition, this method will have its limitation. For example, when one to one relationship is not true based on those four keys, it will generate wrong dataset.
Try this
# assign city column and set index by customer demographic columns
df1 = df.assign(city='New York').set_index(['cust_id', 'city', 'cust_first', 'cust_last'])
# fix column names by completing the abbrs
df1.columns = df1.columns.to_series().replace({'au_zo': 'autozone', 'fi_gu': 'five guys', 'wa': 'walmart'}, regex=True)
# split column names for a multiindex column
df1.columns = pd.MultiIndex.from_tuples([c.split('_') if c.endswith('pay') else [c, 'amount'] for c in df1.columns], names=['store',''])
# stack df1 to make the wide df to a long df
df1 = df1.stack(0).reset_index()
# insert classification column
df1.insert(5, 'classification', df1.store.map({'autozone': 'auto-repair', 'five guys': 'food', 'walmart': 'groceries'}))
df1
One other way is as follows:
df1 is exactly as df with renamed names ie having the name amount in from of the store value
df1 = (df
.rename(lambda x: re.sub('(.*)_pay', 'pay:\\1', x), axis=1)
.rename(lambda x:re.sub('^(((?!cust|pay).)*)$', 'amount:\\1', x), axis=1))
Now pivot to longer using pd.wide_to_long and do the replacement.
df2 = (pd.wide_to_long(df1, stubnames = ['amount', 'pay'],
i = df1.columns[:3], j = 'store', sep=':', suffix='\\w+')
.reset_index().dropna())
store = {'au_zo':'auto zone', 'fi_gu':'five guys', 'wa':'walmart'}
classification = {'au_zo':'auto-repair', 'fi_gu':'food', 'wa':'groceries'}
df2['classification'] = df2['store'].replace(classification)
df2['store'] = df2['store'].replace(store)
cust_id cust_first cust_last store amount pay classification
0 1000 Andrew Jones auto zone 50.85 debit auto-repair
2 1000 Andrew Jones walmart 69.12 debit groceries
4 1001 Fatima Lee five guys 18.16 debit food
8 1002 Sophia Lewis walmart 159.54. credit groceries
9 1003 Edward Bush auto zone 45.29 credit auto-repair
10 1003 Edward Bush five guys 59.63 credit food
12 1004 Mark Nunez auto zone 20.87 credit auto-repair
13 1004 Mark Nunez five guys 20.87 credit food
14 1004 Mark Nunez walmart 86.18 debit groceries
//NB You could consider using pivot_longer from janitor
one option for transforming to long form is with pivot_longer from pyjanitor; it has a lot of options, for this particular use case, we pull out multiple values and multiple names (that are paired with the appropriate regex), before using other Pandas functions to rename and add new columns:
# pip install pyjanitor
import pandas as pd
import janitor
mapper = {'au_zo':'autozone',
'fi_gu':'five guys',
'wa':'walmart'}
store_mapper = {'autozone':'repair',
'five guys':'food',
'walmart':'groceries'}
(df
.assign(city = 'New York')
.pivot_longer(
index = 'c*',
names_to = ['ignore', 'store'],
values_to = ['trans_type', 'amount'],
names_pattern = ['.+pay$', '.+'],
sort_by_appearance=True)
.dropna()
.drop(columns='ignore')
.replace(mapper)
.assign(classification = lambda df: df.store.map(store_mapper))
)
cust_id cust_first cust_last city trans_type store amount classification
0 1000 Andrew Jones New York debit autozone 50.85 repair
2 1000 Andrew Jones New York debit walmart 69.12 groceries
4 1001 Fatima Lee New York debit five guys 18.16 food
8 1002 Sophia Lewis New York credit walmart 159.54. groceries
9 1003 Edward Bush New York credit autozone 45.29 repair
10 1003 Edward Bush New York credit five guys 59.63 food
12 1004 Mark Nunez New York credit autozone 20.87 repair
13 1004 Mark Nunez New York credit five guys 20.87 food
14 1004 Mark Nunez New York debit walmart 86.18 groceries

How do I Group By Date and Measure fields to calculate rank?

I have a data set with Student Names, the date of transaction and the amount.
Each student has made multiple transactions.
I want to calculate current month rank and previous month rank based on total amount for each student.
I am able to do a group by Student Name to calculate the total amount for each student using:
transactions['Totals'] = transactions.groupby('Student Name')['Sale Amount'].transform('sum')
How do I extend this to make two different columns that calculate previous month totals and current month totals for each student, so I can assign previous month and current month ranks to them?
The date is in the following format:
09/05/2015 04:18 PM
07/15/2019 09:50 AM
05/18/2018 02:34 PM
08/11/2018 06:29 PM
06/14/2018 07:42 AM
EDIT : Adding dataframe for reference:
Out[15]:
Date of Transaction Student Name Sale Amount
0 09/05/2015 04:18 PM Dan Kelly 4333
1 07/15/2019 09:50 AM Peter Dyer 8805
2 05/18/2018 02:34 PM Natalie Robertson 5640
3 08/11/2018 06:29 PM Sean Miller 6485
4 06/14/2018 07:42 AM Thomas Forsyth 6815
... ... ...
9977 03/15/2018 09:28 PM Grace Vance 6379
9978 08/07/2019 11:14 PM Alexandra Cameron 6688
9979 01/09/2015 10:53 AM Sebastian Vaughan 2262
9980 05/19/2019 10:00 PM Caroline Blake 6977
9981 01/11/2016 04:05 AM Austin Edmunds 3205
[9982 rows x 3 columns]
EDIT : Adding sample expected output:
I've created a dataframe with the minimal data you informed: 'Student Name', 'Sale Amount', 'Date'
My dataframe:
df = pd.DataFrame([['12/05/2019 04:18 PM','Marisa',500],
['11/29/2019 04:18 PM','Marisa',500],
['11/20/2019 04:18 PM','Marisa',800],
['12/04/2019 04:18 PM','Peter',300],
['11/30/2019 04:18 PM','Peter',300],
['12/05/2019 04:18 PM','Debra',400],
['11/28/2019 04:18 PM','Debra',200],
['11/15/2019 04:18 PM','Debra',600],
['10/23/2019 04:18 PM','Debra',200]],columns=['Date','Student Name','Sale Amount']
)
Be sure date is a datetime column.
df.Date = pd.to_datetime(df.Date)
This gives you the total amount per month per student in the original dataframe:
df['Total'] = df.groupby(['Student Name',pd.Grouper(key='Date', freq='1M')])['Sale Amount'].transform('sum')
Date Student Name Sale Amount Total
0 2019-12-05 16:18:00 Marisa 500 500
1 2019-11-29 16:18:00 Marisa 500 1300
2 2019-11-20 16:18:00 Marisa 800 1300
3 2019-12-04 16:18:00 Peter 300 300
4 2019-11-30 16:18:00 Peter 300 300
5 2019-12-05 16:18:00 Debra 400 400
6 2019-11-28 16:18:00 Debra 200 800
7 2019-11-15 16:18:00 Debra 600 800
8 2019-10-23 16:18:00 Debra 200 200
How to print only the selected results?
df is dnew now:
dnew = df
Let's strip datetime to keep months only:
#Strip date to month
dnew['Date'] = dnew['Date'].apply(lambda x:x.date().strftime('%m'))
Drop Sale Amount entries and group by Student Name and Date (new dataframe is "sales"):
#Drop Sale Amount
sales = dnew.drop(['Sale Amount'], axis=1).groupby(['Student Name','Date'])['Total'].max()
print(sales)
Student Name Date
Debra 10 200
11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300
Actually, "sales" is pandas.core.series.Series and it's important to know that
print(sales.index)
MultiIndex([( 'Debra', '10'),
( 'Debra', '11'),
( 'Debra', '12'),
('Marisa', '11'),
('Marisa', '12'),
( 'Peter', '11'),
( 'Peter', '12')],
names=['Student Name', 'Date'])
from datetime import datetime
curMonth = int(datetime.today().strftime('%m')) #transform to integer to perform (curMonth-1)
#12
#months of interest
moi = sales.iloc[(sales.index.get_level_values('Date') == str(curMonth-1)) | (sales.index.get_level_values('Date') == str(curMonth))]
print(moi)
Student Name Date
Debra 11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300

Group by Year and Month Panda Pivot Table

I have data like this
Date LoanOfficer User_Name Loan_Number
0 2017-11-30 00:00:00 Mark Evans underwriterx 1100000293
1 2017-11-30 00:00:00 Kimberly White underwritery 1100004947
2 2017-11-30 00:00:00 DClair Phillips underwriterz 1100007224
I've created df pivot table like this:
pd.pivot_table(df,index=["User_Name","LoanOfficer"],
values=["Loan_Number"],
aggfunc='count',fill_value=0,
columns=["Date"]
)
However I need the Date column to be grouped by Year and Month. I was looking at other solutions of resampling the dataframe and then applying the pivot but it only does it for Month and Days. Any help would be appreciated
You can convert you Date column to %Y-%m , then do the pivot_table
df.Date=pd.to_datetime(df.Date)
df.Date=df.Date.dt.strftime('%Y-%m')
df
Out[143]:
Date LoanOfficer User_Name Loan_Number
0 2017-11 Mark Evans underwriterx 1100000293
1 2017-11 Kimberly White underwritery 1100004947
2 2017-11 DClair Phillips underwriterz 1100007224
pd.pivot_table(df,index=["User_Name","LoanOfficer"],
values=["Loan_Number"],
aggfunc='count',fill_value=0,
columns=["Date"]
)
Out[144]:
Loan_Number
Date 2017-11
User_Name LoanOfficer
underwriterx Mark Evans 1
underwritery Kimberly White 1
underwriterz DClair Phillips 1

Cannot assign a value to certain columns in Pandas

Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'

Combine two pandas DataFrames where the date fields are within two months of each other

I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0

Categories

Resources