Subtract values with groupby in Pandas dataframe Python - python

I have a dataframe likes this:
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp
8845
1135
0
12
12128
1
Shinva group
LDN corp
11
1243
133
121
113
1
Telegraph group
Freename LLC
5487
223
928
0
0
21
Telegraph group
Grt
0
7543
24
3213
15
21
Zero group
PetZoo crp
5574
0
2
0
6478
1
Zero group
Elephant
48324
0
32
118
4
1
I need to subtract values between cells in the column if they have the same Alliance_name.
(it would be perfect not to subtract the last column Sur, but it is not the main target)
I know that for addition we can make something like this:
df = df.groupby('Alliance_name').sum()
But I don't know how to do this with subtraction.
The result should be like this (if we don't subtract the last column):
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp LDN corp
8834
-108
-133
-109
12015
1
Telegraph group
Freename LLC Grt
5487
-7320
904
-3212
-15
21
Zero group
PetZoo crp Elephant
-42750
0
-30
-118
6474
1
Thanks for your help!

You could invert the values to subtract, and then sum them.
df.loc[df.Alliance_name.duplicated(keep="first"), ["TOAD", "MBA", "Class", "EVE", "TBD", "Sur"]] *= -1
df.groupby("Alliance_name").sum()

The .first() and .last() groupby methods can be useful for such tasks.
You can organize the columns you want to skip/compute
>>> df.columns
Index(['Alliance_name', 'Company_name', 'TOAD', 'MBA', 'Class', 'EVE', 'TBD',
'Sur'],
dtype='object')
>>> alliance, company, *cols, sur = df.columns
>>> groups = df.groupby(alliance)
>>> company = groups.first()[[company]]
>>> sur = groups.first()[sur]
>>> groups = groups[cols]
And use .first() - .last() directly:
>>> groups.first() - groups.last()
TOAD MBA Class EVE TBD
Alliance_name
Shinva group 8834 -108 -133 -109 12015
Telegraph group 5487 -7320 904 -3213 -15
Zero group -42750 0 -30 -118 6474
Then .join() the other columns back in
>>> company.join(groups.first() - groups.last()).join(sur).reset_index()
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
1 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
2 Zero group PetZoo crp -42750 0 -30 -118 6474 1
Another approach:
>>> df - df.drop(columns=['Company_name', 'Sur']) .groupby('Alliance_name').shift(-1)
Alliance_name Class Company_name EVE MBA Sur TBD TOAD
0 NaN -133.0 NaN -109.0 -108.0 NaN 12015.0 8834.0
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 904.0 NaN -3213.0 -7320.0 NaN -15.0 5487.0
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN -30.0 NaN -118.0 0.0 NaN 6474.0 -42750.0
5 NaN NaN NaN NaN NaN NaN NaN NaN
You can then drop the all nan rows and fill the remainder values from the original df.
>>> ((df - df.drop(columns=['Company_name', 'Sur'])
.groupby('Alliance_name').shift(-1)).dropna(how='all')[df.columns].fillna(df))
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
2 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
4 Zero group PetZoo crp -42750 0 -30 -118 6474 1

Related

Avoid aggregation error in groupby after merging dataframes in pandas

I have two dataframe -
emp10
empno ename job mgr hiredate sal comm deptno
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
emp bonus
empno received type
0 7934 2005-03-17 1
1 7934 2005-02-15 2
2 7839 2005-02-15 3
3 7782 2005-02-15 1
And if i sum the salary of all the employees from emp, you can see that it is 8750 at the moment.
emp10['sal'].sum()
8750
Now, I want to join both the dataframe to also calculate the bonus of all the employees.
emp10bonus = emp10.merge(emp_bonus, on='empno')
def get_bonus(row):
if row['type'] == 1:
bonus = row['sal']*0.1
elif row['type'] == 2:
bonus = row['sal']* 0.2
else:
bonus = row['sal']* 0.3
return bonus
emp10bonus['bonus'] = emp10bonus.apply(get_bonus, axis=1)
empno ename job mgr hiredate sal comm deptno received type bonus
0 7782 CLARK MANAGER 7839.0 1981-06-09 2450. NaN 10 2005-02-15 1 245.0
1 7839 KING PRESIDENT NaN 1981-11-17 5000. NaN 10 2005-02-15 3 1500.0
2 7934 MILLER CLERK 7782.0 1982-01-23 1300. NaN 10 2005-03-17 1 130.0
3 7934 MILLER CLERK 7782.0 1982-01-23 1300. NaN 10 2005-02-15 2 260.0
Now, if i try to calculate the sum, i am getting wrong result.
emp10bonus.groupby('empno')[['sal','bonus']].sum()
sal bonus
empno
7782 2450 245.0
7839 5000 1500.0
7934 2600 390.0
emp10bonus.groupby('empno')[['sal','bonus']].sum()['sal'].sum()
10050
The two bonuses of Miller in emp bonus table causing double counting of his salary when joining both the tables.
How can I avoid this error?
Try this:
emp10bonus = emp10.merge(emp_bonus, on='empno')
emp10bonus['multiplier'] = np.select([emp10bonus['type']==1, emp10bonus['type']==2],[.1,.2],.3)
emp10bonus = emp10bonus.eval('bonus = sal*multiplier')
emp10bonus.groupby('empno').agg({'sal':'first','bonus':'sum'})
Output:
sal bonus
empno
7782 2450 245.0
7839 5000 1500.0
7934 1300 390.0

Pandas conditional outer join based on timedelta (merge_asof)

I have multiple dataframes that I need to merge into a single dataset based on a unique identifier (uid), and on the timedelta between dates in each dataframe.
Here's a simplified example of the dataframes:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
Basically I am trying to merge records for the same person from two separate sources, where there link between records for unique individuals is the 'uid', and the link between rows (where it exists) for each individiual is a fuzzy relationship between 'tx_date' and 'rx_date' that can (usually) be accomodated by a specific time delta. There won't always be an exact or fuzzy match between dates, data could be missing from any column except 'uid', and each dataframe will contain a different but intersecting subset of 'uid's.
I need to be able to concatenate rows where the 'uid' columns match, and where the absolute time delta between 'tx_date' and 'rx_date' is within a given range (e.g. max delta of 14 days). Where the time delta is outside that range, or one of either 'tx_date' or 'rx_date' is missing, or where the 'uid' exists in only one of the dataframes, I still need to retain the data in that row. The end result should be something like:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
Seems like pandas.merge_asof should be useful here, but I've not been able to get it to do quite what I need.
Trying merge_asof on two of the real dataframes I have gave an error ValueError: left keys must be sorted
As per this question the problem there was actually due to there being NaT values in the 'date' column for some rows. I dropped the rows with NaT values, and sorted the 'date' columns in each dataframe, but the result still isn't quite what I need.
The code below shows the steps taken.
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
Result:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
It looks like a left join rather than a full outer join, so anywhere there's a row in df2 without a match on 'uid' and 'date' in df1 is lost (and it's not really clear from this simplified example, but I also need to add the rows back in where the date was NaT).
Is there some way to achieve a lossless merge, either by somehow doing an outer join with merge_asof, or using some other approach?

How to find the row with specified value in DataFrame

As I am newbie to deeper DataFrame operations, I would like to ask, how to find eg. the lowest campaign ID in this DataFrame per every customerid which is in this kind of DataFrame? As I learned, iteration should not be done in DataFrame.
orderid customerid campaignid orderdate city state zipcode paymenttype totalprice numorderlines numunits
0 1002854 45978 2141 2009-10-13 NEWTON MA 02459 VI 190.00 3 3
1 1002855 125381 2173 2009-10-13 NEW ROCHELLE NY 10804 VI 10.00 1 1
2 1002856 103122 2141 2011-06-02 MIAMI FL 33137 AE 35.22 2 2
3 1002857 130980 2173 2009-10-14 E RUTHERFORD NJ 07073 AE 10.00 1 1
4 1002886 48553 2141 2010-11-19 BALTIMORE MD 21218 VI 10.00 1 1
5 1002887 106150 2173 2009-10-15 ROWAYTON CT 06853 AE 10.00 1 1
6 1002888 27805 2173 2009-10-15 INDIANAPOLIS IN 46240 VI 10.00 1 1
7 1002889 24546 2173 2009-10-15 PLEASANTVILLE NY 10570 MC 10.00 1 1
8 1002890 43783 2173 2009-10-15 EAST STROUDSBURG PA 18301 DB 29.68 2 2
9 1003004 15688 2173 2009-10-15 ROUND LAKE PARK IL 60073 DB 19.68 1 1
10 1003044 130970 2141 2010-11-22 BLOOMFIELD NJ 07003 AE 10.00 1 1
11 1003045 40048 2173 2010-11-22 SPRINGFIELD IL 62704 MC 10.00 1 1
12 1003046 21927 2141 2010-11-22 WACO TX 76710 MC 17.50 1 1
13 1003075 130971 2141 2010-11-22 FAIRFIELD NJ 07004 MC 59.80 1 4
14 1003076 7117 2141 2010-11-22 BROOKLYN NY 11228 AE 22.50 1 1
Try the following
df.groupby('customerid')['campaignid'].min()
You can group unique values of customerid and subsequently find the minimum value per group for a given column using ['column_name'].min()

How do I extract entire table and store it in CSV file?

I am trying to scrape entire table and want to store it in .csv file.
While I am trying to scrape this data it is showing me error as NO TABLES FOUND.
Here is my code.
from pandas.io.html import read_html
page = 'https://games.crossfit.com/leaderboard/open/2020?view=0&division=1&scaled=0&sort=0'
tables = read_html(page, attrs={"class":"desktop athletes"})
print ("Extracted {num} tables".format(num=len(tables)))
Any suggestion or guidance or any help ?
This page uses JavaScript to get data from server and generate table.
But using DevTool in Chrome/Firefox you can see (in tab Network) all requests from browser to server and one of the XHR/AJAX request gets all data in JSON format so you can use this url to get it also as JSON which you can convert to Python data and you don't have to scrape it.
import requests
r = requests.get('https://games.crossfit.com/competitions/api/v1/competitions/open/2020/leaderboards?view=0&division=1&scaled=0&sort=0')
data = r.json()
for row in data['leaderboardRows']:
print(row['entrant']['competitorName'], row['overallScore'], [(x['rank'],x['scoreDisplay']) for x in row['scores']])
Result
Patrick Vellner 64 [('13', '8:38'), ('19', '988 reps'), ('12', '6:29'), ('18', '16:29'), ('2', '10:09')]
Mathew Fraser 74 [('8', '8:28'), ('40', '959 reps'), ('3', '6:08'), ('2', '14:22'), ('21', '10:45')]
Lefteris Theofanidis 94 [('1', '8:05'), ('3', '1021 reps'), ('13', '6:32'), ('4', '15:00'), ('73', '11:11')]
# ... more ...
As stated below, you can access the api to get the data. To save as CSV, you'll need to work through the json format to get what you need (ie. flatten out the nested data). There's 2 ways to do it, a) completely flatten it out so that each row is for each entrant, or b) have separate rows for each entrant for each of their ordinal scores.
The only differences will be if you choose a) you'll have a really wide table (but no repeated data), and if you go with b) you'll have a long table, with repeat of data.
Since it's not too big of a file, I went with option b) so you can always groupby particular columns or filter:
import requests
import pandas as pd
r = requests.get('https://games.crossfit.com/competitions/api/v1/competitions/open/2020/leaderboards?view=0&division=1&scaled=0&sort=0')
data = r.json()
results = pd.DataFrame()
df = pd.DataFrame(data['leaderboardRows'])
for idx, row in df.iterrows():
entrantData = pd.Series()
scoresData = pd.DataFrame()
entrantResults = pd.DataFrame()
for idx2, each in row.iteritems():
if type(each) == dict:
temp = pd.DataFrame.from_dict(each, orient='index')
entrantData = entrantData.append(temp)
elif type(each) == list:
temp2 = pd.DataFrame(each)
scoresData = scoresData.append(temp2, sort=True).reset_index(drop=True)
else:
entrantData = entrantData.append(pd.Series(each, name=idx2))
entrantResults = entrantResults.append(scoresData, sort=True).reset_index(drop=True)
entrantResults = entrantResults.merge(pd.concat([entrantData.T] *5, ignore_index=True), left_index=True, right_index=True)
results = results.append(entrantResults, sort=True).reset_index(drop=True)
results.to_csv('file.csv', index=False)
Output: first 15 rows of 250
print (results.head(15).to_string())
affiliate affiliateId affiliateName age breakdown competitorId competitorName countryChampion countryOfOriginCode countryOfOriginName divisionId drawBlueHR firstName gender heat height highlight judge lane lastName mobileScoreDisplay nextStage ordinal overallRank overallScore postCompStatus profilePicS3key rank scaled score scoreDisplay scoreIdentifier status time video weight
0 CrossFit Nanaimo 1918 CrossFit Nanaimo 30 10 rounds 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Dallyn Giroux Vellner 1 1 64 d471c-P158264_7-184.jpg 13 0 11800382 8:38 9d3979222412df2842a1 ACT 518 0 195 lb
1 CrossFit Soul Miami 1918 CrossFit Nanaimo 30 29 rounds +\n2 thrusters\n 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Lamar Vernon Vellner 2 1 64 d471c-P158264_7-184.jpg 19 0 1009880000 988 reps 9bd66b00e8367cc7fd0c ACT NaN 0 195 lb
2 CrossFit Nanaimo 1918 CrossFit Nanaimo 30 165 reps 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Jason Lochhead Vellner 3 1 64 d471c-P158264_7-184.jpg 12 0 1001650151 6:29 2347b4cb7339f2a13e6c ACT 389 0 195 lb
3 CrossFit Nanaimo 1918 CrossFit Nanaimo 30 240 reps 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Dallyn Giroux Vellner 4 1 64 d471c-P158264_7-184.jpg 18 0 1002400211 16:29 bcfd3882df3fa2e99451 ACT 989 0 195 lb
4 CrossFit New England 1918 CrossFit Nanaimo 30 240 reps 158264 Patrick Vellner False CA Canada 1 NaN Patrick M 71 in False Matt O'Keefe Vellner 5 1 64 d471c-P158264_7-184.jpg 2 0 1002400591 10:09 4bb25bed5f71141da122 ACT 609 0 195 lb
5 CrossFit Mayhem 3220 CrossFit Mayhem 30 10 rounds 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False darren hunsucker Fraser 1 2 74 9e218-P153604_4-184.jpg 8 0 11800392 8:28 18b5b2e137f00a2d9d7d ACT 508 0 195 lb
6 CrossFit Soul Miami 3220 CrossFit Mayhem 30 28 rounds +\n4 thrusters\n3 toes-to-bars\n 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Daniel Lopez Fraser 2 2 74 9e218-P153604_4-184.jpg 40 0 1009590000 959 reps b96bc1b7b58fa34a28a1 ACT NaN 0 195 lb
7 CrossFit Mayhem 3220 CrossFit Mayhem 30 165 reps 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Jason Fernandez Fraser 3 2 74 9e218-P153604_4-184.jpg 3 0 1001650172 6:08 4f4a994a045652c894c5 ACT 368 0 195 lb
8 CrossFit Mayhem 3220 CrossFit Mayhem 30 240 reps 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Tasia Percevecz Fraser 4 2 74 9e218-P153604_4-184.jpg 2 0 1002400338 14:22 1a4a7d8760e72bb12d68 ACT 862 0 195 lb
9 CrossFit Mayhem 3220 CrossFit Mayhem 30 240 reps 153604 Mathew Fraser True US United States 1 NaN Mathew M 67 in False Kelley Jackson Fraser 5 2 74 9e218-P153604_4-184.jpg 21 0 1002400555 10:45 b4a259e7049f47f65356 ACT 645 0 195 lb
10 NaN 0 30 10 rounds 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 1 3 94 931eb-P514502_2-184.jpg 1 0 11800415 8:05 c8907e02512f42ff3142 ACT 485 1 81 kg
11 NaN 0 30 30 rounds +\n1 thruster\n 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 2 3 94 931eb-P514502_2-184.jpg 3 0 1010210000 1021 reps 63add31b22606957701c ACT NaN 1 81 kg
12 NaN 0 30 165 reps 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 3 3 94 931eb-P514502_2-184.jpg 13 0 1001650148 6:32 46d7cdb691c25ea38dbe ACT 392 1 81 kg
13 NaN 0 30 240 reps 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 4 3 94 931eb-P514502_2-184.jpg 4 0 1002400300 15:00 d49e55a2af5840740071 ACT 900 1 81 kg
14 NaN 0 30 240 reps 514502 Lefteris Theofanidis True GR Greece 1 NaN Lefteris M 171 cm False NaN Theofanidis 5 3 94 931eb-P514502_2-184.jpg 73 0 1002400529 11:11 d35c9d687eb6b72c8e36 ACT 671 1 81 kg

Sub totals and grand totals in Python [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I was trying to make a sub totals and grand totals for a data. But some where i stuck and couldn't make my deserved output. Could you please assist on this.
data.groupby(['Column4', 'Column5'])['Column1'].count()
Current Output:
Column4 Column5
2018-05-19 Duplicate 220
Informative 3
2018-05-20 Actionable 5
Duplicate 270
Informative 859
Non-actionable 2
2018-05-21 Actionable 8
Duplicate 295
Informative 17
2018-05-22 Actionable 10
Duplicate 424
Informative 36
2018-05-23 Actionable 8
Duplicate 157
Informative 3
2018-05-24 Actionable 5
Duplicate 78
Informative 3
2018-05-25 Actionable 3
Duplicate 80
Expected Output:
Row Labels Actionable Duplicate Informative Non-actionable Grand Total
5/19/2018 219 3 222
5/20/2018 5 270 859 2 1136
5/21/2018 8 295 17 320
5/22/2018 10 424 36 470
5/23/2018 8 157 3 168
5/24/2018 5 78 3 86
5/25/2018 3 80 83
Grand Total 39 1523 921 2 2485
This is a sample data. Could you please have a look with before my ask. I am getting minuted errors. May be i wasn't gave right data. Please kindly check for once.
Column1 Column2 Column3 Column4 Column5 Column6
BI Account Subject1 2:12 PM 5/19/2018 Duplicate Name1
PI Account Subject2 1:58 PM 5/19/2018 Actionable Name2
AI Account Subject3 5:01 PM 5/19/2018 Non-Actionable Name3
BI Account Subject4 5:57 PM 5/19/2018 Informative Name4
PI Account Subject5 6:59 PM 5/19/2018 Duplicate Name5
AI Account Subject6 8:07 PM 5/19/2018 Actionable Name1
You can use pivot to get from your current output to your desired output and then sum to calculate the totals you want.
import pandas as pd
df = df.reset_index().pivot('index', values='Column5', columns='Column4')
# Add grand total columns, summing across all other columns
df['Grand Total'] = df.sum(axis=1)
df.columns.name = None
df.index.name = None
# Add the grand total row, summing all values in a column
df.loc['Grand Total', :] = df.sum()
df is now:
Actionable Duplicate Informative Non-actionable Grand Total
2018-05-19 NaN 220.0 3.0 NaN 223.0
2018-05-20 5.0 270.0 859.0 2.0 1136.0
2018-05-21 8.0 295.0 17.0 NaN 320.0
2018-05-22 10.0 424.0 36.0 NaN 470.0
2018-05-23 8.0 157.0 3.0 NaN 168.0
2018-05-24 5.0 78.0 3.0 NaN 86.0
2018-05-25 3.0 80.0 NaN NaN 83.0
Grand Total 39.0 1524.0 921.0 2.0 2486.0
Just using crosstab
pd.crosstab(df['Column4'], df['Column5'], margins = True, margins_name = 'Grand Total' )
Take a look at this :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html
You need to pivot your table :
df.reset_index().pivot(index='date', columns='Column4', values='Column5')

Categories

Resources