How do I Group By Date and Measure fields to calculate rank? - python

I have a data set with Student Names, the date of transaction and the amount.
Each student has made multiple transactions.
I want to calculate current month rank and previous month rank based on total amount for each student.
I am able to do a group by Student Name to calculate the total amount for each student using:
transactions['Totals'] = transactions.groupby('Student Name')['Sale Amount'].transform('sum')
How do I extend this to make two different columns that calculate previous month totals and current month totals for each student, so I can assign previous month and current month ranks to them?
The date is in the following format:
09/05/2015 04:18 PM
07/15/2019 09:50 AM
05/18/2018 02:34 PM
08/11/2018 06:29 PM
06/14/2018 07:42 AM
EDIT : Adding dataframe for reference:
Out[15]:
Date of Transaction Student Name Sale Amount
0 09/05/2015 04:18 PM Dan Kelly 4333
1 07/15/2019 09:50 AM Peter Dyer 8805
2 05/18/2018 02:34 PM Natalie Robertson 5640
3 08/11/2018 06:29 PM Sean Miller 6485
4 06/14/2018 07:42 AM Thomas Forsyth 6815
... ... ...
9977 03/15/2018 09:28 PM Grace Vance 6379
9978 08/07/2019 11:14 PM Alexandra Cameron 6688
9979 01/09/2015 10:53 AM Sebastian Vaughan 2262
9980 05/19/2019 10:00 PM Caroline Blake 6977
9981 01/11/2016 04:05 AM Austin Edmunds 3205
[9982 rows x 3 columns]
EDIT : Adding sample expected output:

I've created a dataframe with the minimal data you informed: 'Student Name', 'Sale Amount', 'Date'
My dataframe:
df = pd.DataFrame([['12/05/2019 04:18 PM','Marisa',500],
['11/29/2019 04:18 PM','Marisa',500],
['11/20/2019 04:18 PM','Marisa',800],
['12/04/2019 04:18 PM','Peter',300],
['11/30/2019 04:18 PM','Peter',300],
['12/05/2019 04:18 PM','Debra',400],
['11/28/2019 04:18 PM','Debra',200],
['11/15/2019 04:18 PM','Debra',600],
['10/23/2019 04:18 PM','Debra',200]],columns=['Date','Student Name','Sale Amount']
)
Be sure date is a datetime column.
df.Date = pd.to_datetime(df.Date)
This gives you the total amount per month per student in the original dataframe:
df['Total'] = df.groupby(['Student Name',pd.Grouper(key='Date', freq='1M')])['Sale Amount'].transform('sum')
Date Student Name Sale Amount Total
0 2019-12-05 16:18:00 Marisa 500 500
1 2019-11-29 16:18:00 Marisa 500 1300
2 2019-11-20 16:18:00 Marisa 800 1300
3 2019-12-04 16:18:00 Peter 300 300
4 2019-11-30 16:18:00 Peter 300 300
5 2019-12-05 16:18:00 Debra 400 400
6 2019-11-28 16:18:00 Debra 200 800
7 2019-11-15 16:18:00 Debra 600 800
8 2019-10-23 16:18:00 Debra 200 200
How to print only the selected results?
df is dnew now:
dnew = df
Let's strip datetime to keep months only:
#Strip date to month
dnew['Date'] = dnew['Date'].apply(lambda x:x.date().strftime('%m'))
Drop Sale Amount entries and group by Student Name and Date (new dataframe is "sales"):
#Drop Sale Amount
sales = dnew.drop(['Sale Amount'], axis=1).groupby(['Student Name','Date'])['Total'].max()
print(sales)
Student Name Date
Debra 10 200
11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300
Actually, "sales" is pandas.core.series.Series and it's important to know that
print(sales.index)
MultiIndex([( 'Debra', '10'),
( 'Debra', '11'),
( 'Debra', '12'),
('Marisa', '11'),
('Marisa', '12'),
( 'Peter', '11'),
( 'Peter', '12')],
names=['Student Name', 'Date'])
from datetime import datetime
curMonth = int(datetime.today().strftime('%m')) #transform to integer to perform (curMonth-1)
#12
#months of interest
moi = sales.iloc[(sales.index.get_level_values('Date') == str(curMonth-1)) | (sales.index.get_level_values('Date') == str(curMonth))]
print(moi)
Student Name Date
Debra 11 800
12 400
Marisa 11 1300
12 500
Peter 11 300
12 300

Related

Pivot table in Python using Pandas

I have a data frame which has data in following format:
I have to pivot up the Status column and Pivot down the states Columns to make table look like:
I am trying to do it using pd.pivot_table but unable to get the desired results.
Here is what I am trying:
table = pd.pivot_table(data = covid19_df_latest, index = ['Date', 'Delhi', 'Maharashtra', 'Haryana'], values = ['Status'], aggfunc = np.max)
print(table)
I am getting error "No numeric types to aggregate", Please suggest
Use DataFrame.melt with DataFrame.pivot_table:
df= (covid19_df_latest.melt(['Date','Status'], var_name='State')
.pivot_table(index=['Date','State'],
columns='Status',
values='value',
aggfunc='max')
.reset_index()
.rename_axis(None, axis=1))
print (df)
Date State Deceased Identified Recovered
0 14/05/20 Delhi 1200 10000 2000
1 14/05/20 Haryana 1000 20000 800
2 14/05/20 Maharashtra 1000 15000 3700
Details: Solution first unpivot Dataframe by melt:
print (covid19_df_latest.melt(['Date','Status'], var_name='State'))
Date Status State value
0 14/05/20 Identified Delhi 10000
1 14/05/20 Recovered Delhi 2000
2 14/05/20 Deceased Delhi 1200
3 14/05/20 Identified Maharashtra 15000
4 14/05/20 Recovered Maharashtra 3700
5 14/05/20 Deceased Maharashtra 1000
6 14/05/20 Identified Haryana 20000
7 14/05/20 Recovered Haryana 800
8 14/05/20 Deceased Haryana 1000
and then pivoting with max aggregate function.

How to "groupby" year, column1 and calculate the average on column 2?

I have this DataFrame:
year vehicule number_of_passengers
2017-01-09 bus 100
2017-11-02 car 150
2018-08-01 car 180
2016-08-09 bus 100
...
I would like to have something like this (the average number of passengers per year and per vehicule) :
year vehicule avg_number_of_passengers
2018 car 123.5
2018 bus 213.7
2017 ... ...
...
I've tried with some groupby() but can't find the good command. Can you help me ?

In a pandas dataframe, count the number of times a condition occurs in one column?

Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3

Transforming dataframe to track the changes

i have some students data and the subjects they have elected.
id name date from date to Subjectname note
1188 Cera 01-08-2016 30-09-2016 math approved
1188 Cera 01-10-2016 elec
1199 ron 01-06-2017 english app-true
1288 Snow 01-01-2017 tally
1433 sansa 25-01-2016 14-07-2016 tally
1433 sansa 15-07-2016 16-01-2017 tally relected
1844 amy 01-10-2016 10-11-2017 adv
1522 stark 01-01-2016 phy
1722 sid 01-06-2017 31-03-2018 history
1722 sid 01-04-2018 history as per request
1844 amy 01-01-2016 30-09-2016 science
2100 arya 01-08-2016 30-09-2016 english
2100 arya 01-10-2016 31-05-2017 math taken
2100 arya 01-06-2017 english
I am looking for outpur like:
id name from to subject from subject to
1188 Cera 01-08-2016 01-10-2016 math elec
1199 ron 01-06-2017 english
1288 Snow 01-01-2017 tally
1433 sansa 25-01-2016 16-01-2017 tally tally
1522 stark 01-01-2016 phy
1722 sid 01-06-2017 01-04-2018 history history
1844 amy 01-01-2016 10-11-2017 science adv
2100 arya 01-08-2016 31-05-2017 english math
2100 arya 01-06-2017 math english
column 'from' has the minimum date value corresponding to the name.
column 'to' has the maximum date value corresponding to the name.
column 'subject from' has the 'Subjectname' value corresponding to the column 'from' and 'name'.
column 'subject to' has the 'Subjectname' value corresponding to the column 'to' and 'name'.
i need to track the transaction made by student and the subjectname they changed (subject from and subject to).
Please let me know how to achieve this.
or please let me know if there is an easy way to get the an output which contains transaction details per student and the subject they changed.
Use DataFrameGroupBy.agg with set_index by column Subjectname, so is possible use idxmin and
idxmax for subject by minimal and maximal datetimes per groups:
df['date from'] = pd.to_datetime(df['date from'])
df['date to'] = pd.to_datetime(df['date to'])
d = {'date from':['min', 'idxmin'], 'date to':['max', 'idxmax']}
df1 = df.set_index('Subjectname').groupby(['id','name']).agg(d)
df1.columns = df1.columns.map('_'.join)
d1 = {'date from_min':'from','date to_max':'to',
'date from_idxmin':'subject from','date to_idxmax':'subject to'}
cols = ['from','to','subject from','subject to']
df1 = df1.rename(columns=d1).reindex(columns=cols).reset_index()
print (df1)
id name from to subject from subject to
0 1188 Cera 2016-01-08 2016-09-30 math math
1 1199 ron 2017-01-06 NaT english NaN
2 1288 Snow 2017-01-01 NaT tally NaN
3 1433 sansa 2016-01-25 2017-01-16 tally tally
4 1522 stark 2016-01-01 NaT phy NaN
5 1722 sid 2017-01-06 2018-03-31 history history
6 1844 amy 2016-01-01 2017-10-11 science adv
7 2100 arya 2016-01-08 2017-05-31 english math
my df from your first 3 rows, it should be ok to demo how to do this.
df:
id name date_from date_to subject_name note
0 1188 Cera 2016-01-08 30-09-2016 math approved
1 1188 Cera 2016-01-10 elec
2 1199 ron 2017-01-06 english app-true
just paste code here.
# make date from and date to to one column to get max and min date
df1 = df[['id', 'name', 'date_from', 'subject_name', 'note']]
df2 = df[['id', 'name', 'date_to', 'subject_name', 'note']]
df3 = pd.concat([df1,df2])
df1.columns = ['id', 'name', 'date', 'subject_name', 'note']
df2.columns = ['id', 'name', 'date', 'subject_name', 'note']
df3 = pd.concat([df1,df2])
df3['date'] = pd.to_datetime(df3['date'])
df3 = df3.dropna()
df3:
id name date subject_name note
0 1188 Cera 2016-01-08 math approved
1 1188 Cera 2016-01-10 elec
2 1199 ron 2017-01-06 english app-true
0 1188 Cera 2016-09-30 math approved
#here you get from and to date for each name
df4 = df3.groupby('name').agg({'date':[max,min]})
df4.columns = ['to','from']
df4 = df4.reset_index()
df4:
name to from
0 Cera 2016-09-30 2016-01-08
1 ron 2017-01-06 2017-01-06
# match "name" and "to" in df4 with "name" and "date" in df3, you got the earliest subject and latest
df_sub_from = pd.merge(df4,df3,how='left',left_on=['name','to'],right_on=['name','date'])
df_sub_from
df_sub_to = pd.merge(df4,df3,how='left',left_on=['name','to'],right_on=['name','date'])
df_sub_from = pd.merge(df4,df3,how='left',left_on=['name','from'],right_on=['name','date'])
#remove unneed columns
df_sub_from = df_sub_from[['id','name','from','to','subject_name']]
df_sub_to = df_sub_to[['id','name','from','to','subject_name']]
# merge together and rename nicely
df_final = pd.merge(df_sub_from,df_sub_to,left_on=['id','name','from','to'],right_on=['id','name','from','to'])
df_final.columns = ['id','name','from','to','subject_from','subject_to']
here it is:
id name from to subject_from subject_to
0 1188 Cera 2016-01-08 2016-09-30 math math
1 1199 ron 2017-01-06 2017-01-06 english english

Combine two pandas DataFrames where the date fields are within two months of each other

I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.
Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0

Categories

Resources