Avoid aggregation error in groupby after merging dataframes in pandas - python

I have two dataframe -
emp10
empno ename job mgr hiredate sal comm deptno
6 7782 CLARK MANAGER 7839.0 1981-06-09 2450 NaN 10
8 7839 KING PRESIDENT NaN 1981-11-17 5000 NaN 10
13 7934 MILLER CLERK 7782.0 1982-01-23 1300 NaN 10
emp bonus
empno received type
0 7934 2005-03-17 1
1 7934 2005-02-15 2
2 7839 2005-02-15 3
3 7782 2005-02-15 1
And if i sum the salary of all the employees from emp, you can see that it is 8750 at the moment.
emp10['sal'].sum()
8750
Now, I want to join both the dataframe to also calculate the bonus of all the employees.
emp10bonus = emp10.merge(emp_bonus, on='empno')
def get_bonus(row):
if row['type'] == 1:
bonus = row['sal']*0.1
elif row['type'] == 2:
bonus = row['sal']* 0.2
else:
bonus = row['sal']* 0.3
return bonus
emp10bonus['bonus'] = emp10bonus.apply(get_bonus, axis=1)
empno ename job mgr hiredate sal comm deptno received type bonus
0 7782 CLARK MANAGER 7839.0 1981-06-09 2450. NaN 10 2005-02-15 1 245.0
1 7839 KING PRESIDENT NaN 1981-11-17 5000. NaN 10 2005-02-15 3 1500.0
2 7934 MILLER CLERK 7782.0 1982-01-23 1300. NaN 10 2005-03-17 1 130.0
3 7934 MILLER CLERK 7782.0 1982-01-23 1300. NaN 10 2005-02-15 2 260.0
Now, if i try to calculate the sum, i am getting wrong result.
emp10bonus.groupby('empno')[['sal','bonus']].sum()
sal bonus
empno
7782 2450 245.0
7839 5000 1500.0
7934 2600 390.0
emp10bonus.groupby('empno')[['sal','bonus']].sum()['sal'].sum()
10050
The two bonuses of Miller in emp bonus table causing double counting of his salary when joining both the tables.
How can I avoid this error?

Try this:
emp10bonus = emp10.merge(emp_bonus, on='empno')
emp10bonus['multiplier'] = np.select([emp10bonus['type']==1, emp10bonus['type']==2],[.1,.2],.3)
emp10bonus = emp10bonus.eval('bonus = sal*multiplier')
emp10bonus.groupby('empno').agg({'sal':'first','bonus':'sum'})
Output:
sal bonus
empno
7782 2450 245.0
7839 5000 1500.0
7934 1300 390.0

Related

Subtract values with groupby in Pandas dataframe Python

I have a dataframe likes this:
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp
8845
1135
0
12
12128
1
Shinva group
LDN corp
11
1243
133
121
113
1
Telegraph group
Freename LLC
5487
223
928
0
0
21
Telegraph group
Grt
0
7543
24
3213
15
21
Zero group
PetZoo crp
5574
0
2
0
6478
1
Zero group
Elephant
48324
0
32
118
4
1
I need to subtract values between cells in the column if they have the same Alliance_name.
(it would be perfect not to subtract the last column Sur, but it is not the main target)
I know that for addition we can make something like this:
df = df.groupby('Alliance_name').sum()
But I don't know how to do this with subtraction.
The result should be like this (if we don't subtract the last column):
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp LDN corp
8834
-108
-133
-109
12015
1
Telegraph group
Freename LLC Grt
5487
-7320
904
-3212
-15
21
Zero group
PetZoo crp Elephant
-42750
0
-30
-118
6474
1
Thanks for your help!
You could invert the values to subtract, and then sum them.
df.loc[df.Alliance_name.duplicated(keep="first"), ["TOAD", "MBA", "Class", "EVE", "TBD", "Sur"]] *= -1
df.groupby("Alliance_name").sum()
The .first() and .last() groupby methods can be useful for such tasks.
You can organize the columns you want to skip/compute
>>> df.columns
Index(['Alliance_name', 'Company_name', 'TOAD', 'MBA', 'Class', 'EVE', 'TBD',
'Sur'],
dtype='object')
>>> alliance, company, *cols, sur = df.columns
>>> groups = df.groupby(alliance)
>>> company = groups.first()[[company]]
>>> sur = groups.first()[sur]
>>> groups = groups[cols]
And use .first() - .last() directly:
>>> groups.first() - groups.last()
TOAD MBA Class EVE TBD
Alliance_name
Shinva group 8834 -108 -133 -109 12015
Telegraph group 5487 -7320 904 -3213 -15
Zero group -42750 0 -30 -118 6474
Then .join() the other columns back in
>>> company.join(groups.first() - groups.last()).join(sur).reset_index()
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
1 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
2 Zero group PetZoo crp -42750 0 -30 -118 6474 1
Another approach:
>>> df - df.drop(columns=['Company_name', 'Sur']) .groupby('Alliance_name').shift(-1)
Alliance_name Class Company_name EVE MBA Sur TBD TOAD
0 NaN -133.0 NaN -109.0 -108.0 NaN 12015.0 8834.0
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 904.0 NaN -3213.0 -7320.0 NaN -15.0 5487.0
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN -30.0 NaN -118.0 0.0 NaN 6474.0 -42750.0
5 NaN NaN NaN NaN NaN NaN NaN NaN
You can then drop the all nan rows and fill the remainder values from the original df.
>>> ((df - df.drop(columns=['Company_name', 'Sur'])
.groupby('Alliance_name').shift(-1)).dropna(how='all')[df.columns].fillna(df))
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
2 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
4 Zero group PetZoo crp -42750 0 -30 -118 6474 1

Copy values from one column to another column with different rows based on two conditions

my dataframe looks basically like this:
data = [[11200, 33000,dt.datetime(1995,3,1),10,np.nan], [11200, 33000, dt.datetime(1995,3,2),11, np.nan],[11200, 33000, dt.datetime(1995,3,3),9, np.nan],\
[23400, 45000, dt.datetime(1995,3,1),50, np.nan], [23400, 45000, dt.datetime(1995,3,3),49, np.nan], [33000, 55000, dt.datetime(1995,3,1),60, np.nan], [33000, 55000, dt.datetime(1995,3,2),61, np.nan]]
df = pd.DataFrame(data, columns = ["Identifier", "Identifier2" ,"date", "price","price2"])
Output looks like:
index Identifier1 Identifier2 date price1 price2
0 11200 33000 1995-03-01 10 nan
1 11200 33000 1995-03-02 11 nan
2 11200 33000 1995-03-03 9 nan
3 23400 45000 1995-03-01 50 nan
4 23400 45000 1995-03-03 49 nan
5 33000 55000 1995-03-01 60 nan
6 33000 55000 1995-03-02 61 nan
Please note that my index is not sorted by ascending numbers like to one of my example df.
I would like to: look for the number that is in column Identifier2 (I know the exact number I want to look up) in column Identifier 1. Then copy the value of price1 into price2 with respect to correct dates, because some dates are missing.
My goal would look like this:
index Identifier1 Identifier2 date price1 price2
0 11200 33000 1995-03-01 10 60
1 11200 33000 1995-03-02 11 61
2 11200 33000 1995-03-03 9 nan
3 23400 45000 1995-03-01 50 nan
4 23400 45000 1995-03-03 49 nan
5 33000 55000 1995-03-01 60 nan
6 33000 55000 1995-03-02 61 nan
I'm sure this is not too difficult, but somehow I don't get it.
Thank you very much in advance for any help.
One way:
df['price2'] = df[['Identifier2', 'date']].apply(tuple, 1).map(df.set_index(['Identifier','date'])['price'].to_dict())
OUTPUT:
Identifier Identifier2 date price price2
0 11200 33000 1995-03-01 10 60.0
1 11200 33000 1995-03-02 11 61.0
2 11200 33000 1995-03-03 9 NaN
3 23400 45000 1995-03-01 50 NaN
4 23400 45000 1995-03-03 49 NaN
5 33000 55000 1995-03-01 60 NaN
6 33000 55000 1995-03-02 61 NaN
I don't know if is the best way, but this works:
Using merge:
#Get a copy like 2 separated dataframe's
df1 = df [['index', 'Identifier', 'Identifier2','date', 'price']]
df2 = df [['Identifier','date', 'price']]
#Mergin on left
df3 = df1.merge(df2, how = 'left' ,left_on = ['Identifier2','date'] , right_on =['Identifier','date'], suffixes=('','R'))
#Drop created IdentifierR column an rename priceR to price2
df4 = df3.drop('IdentifierR', axis=1).rename(columns={'priceR':'price2'})

Pandas conditional outer join based on timedelta (merge_asof)

I have multiple dataframes that I need to merge into a single dataset based on a unique identifier (uid), and on the timedelta between dates in each dataframe.
Here's a simplified example of the dataframes:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
Basically I am trying to merge records for the same person from two separate sources, where there link between records for unique individuals is the 'uid', and the link between rows (where it exists) for each individiual is a fuzzy relationship between 'tx_date' and 'rx_date' that can (usually) be accomodated by a specific time delta. There won't always be an exact or fuzzy match between dates, data could be missing from any column except 'uid', and each dataframe will contain a different but intersecting subset of 'uid's.
I need to be able to concatenate rows where the 'uid' columns match, and where the absolute time delta between 'tx_date' and 'rx_date' is within a given range (e.g. max delta of 14 days). Where the time delta is outside that range, or one of either 'tx_date' or 'rx_date' is missing, or where the 'uid' exists in only one of the dataframes, I still need to retain the data in that row. The end result should be something like:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
Seems like pandas.merge_asof should be useful here, but I've not been able to get it to do quite what I need.
Trying merge_asof on two of the real dataframes I have gave an error ValueError: left keys must be sorted
As per this question the problem there was actually due to there being NaT values in the 'date' column for some rows. I dropped the rows with NaT values, and sorted the 'date' columns in each dataframe, but the result still isn't quite what I need.
The code below shows the steps taken.
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
Result:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
It looks like a left join rather than a full outer join, so anywhere there's a row in df2 without a match on 'uid' and 'date' in df1 is lost (and it's not really clear from this simplified example, but I also need to add the rows back in where the date was NaT).
Is there some way to achieve a lossless merge, either by somehow doing an outer join with merge_asof, or using some other approach?

How to find the row with specified value in DataFrame

As I am newbie to deeper DataFrame operations, I would like to ask, how to find eg. the lowest campaign ID in this DataFrame per every customerid which is in this kind of DataFrame? As I learned, iteration should not be done in DataFrame.
orderid customerid campaignid orderdate city state zipcode paymenttype totalprice numorderlines numunits
0 1002854 45978 2141 2009-10-13 NEWTON MA 02459 VI 190.00 3 3
1 1002855 125381 2173 2009-10-13 NEW ROCHELLE NY 10804 VI 10.00 1 1
2 1002856 103122 2141 2011-06-02 MIAMI FL 33137 AE 35.22 2 2
3 1002857 130980 2173 2009-10-14 E RUTHERFORD NJ 07073 AE 10.00 1 1
4 1002886 48553 2141 2010-11-19 BALTIMORE MD 21218 VI 10.00 1 1
5 1002887 106150 2173 2009-10-15 ROWAYTON CT 06853 AE 10.00 1 1
6 1002888 27805 2173 2009-10-15 INDIANAPOLIS IN 46240 VI 10.00 1 1
7 1002889 24546 2173 2009-10-15 PLEASANTVILLE NY 10570 MC 10.00 1 1
8 1002890 43783 2173 2009-10-15 EAST STROUDSBURG PA 18301 DB 29.68 2 2
9 1003004 15688 2173 2009-10-15 ROUND LAKE PARK IL 60073 DB 19.68 1 1
10 1003044 130970 2141 2010-11-22 BLOOMFIELD NJ 07003 AE 10.00 1 1
11 1003045 40048 2173 2010-11-22 SPRINGFIELD IL 62704 MC 10.00 1 1
12 1003046 21927 2141 2010-11-22 WACO TX 76710 MC 17.50 1 1
13 1003075 130971 2141 2010-11-22 FAIRFIELD NJ 07004 MC 59.80 1 4
14 1003076 7117 2141 2010-11-22 BROOKLYN NY 11228 AE 22.50 1 1
Try the following
df.groupby('customerid')['campaignid'].min()
You can group unique values of customerid and subsequently find the minimum value per group for a given column using ['column_name'].min()

Sub totals and grand totals in Python [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I was trying to make a sub totals and grand totals for a data. But some where i stuck and couldn't make my deserved output. Could you please assist on this.
data.groupby(['Column4', 'Column5'])['Column1'].count()
Current Output:
Column4 Column5
2018-05-19 Duplicate 220
Informative 3
2018-05-20 Actionable 5
Duplicate 270
Informative 859
Non-actionable 2
2018-05-21 Actionable 8
Duplicate 295
Informative 17
2018-05-22 Actionable 10
Duplicate 424
Informative 36
2018-05-23 Actionable 8
Duplicate 157
Informative 3
2018-05-24 Actionable 5
Duplicate 78
Informative 3
2018-05-25 Actionable 3
Duplicate 80
Expected Output:
Row Labels Actionable Duplicate Informative Non-actionable Grand Total
5/19/2018 219 3 222
5/20/2018 5 270 859 2 1136
5/21/2018 8 295 17 320
5/22/2018 10 424 36 470
5/23/2018 8 157 3 168
5/24/2018 5 78 3 86
5/25/2018 3 80 83
Grand Total 39 1523 921 2 2485
This is a sample data. Could you please have a look with before my ask. I am getting minuted errors. May be i wasn't gave right data. Please kindly check for once.
Column1 Column2 Column3 Column4 Column5 Column6
BI Account Subject1 2:12 PM 5/19/2018 Duplicate Name1
PI Account Subject2 1:58 PM 5/19/2018 Actionable Name2
AI Account Subject3 5:01 PM 5/19/2018 Non-Actionable Name3
BI Account Subject4 5:57 PM 5/19/2018 Informative Name4
PI Account Subject5 6:59 PM 5/19/2018 Duplicate Name5
AI Account Subject6 8:07 PM 5/19/2018 Actionable Name1
You can use pivot to get from your current output to your desired output and then sum to calculate the totals you want.
import pandas as pd
df = df.reset_index().pivot('index', values='Column5', columns='Column4')
# Add grand total columns, summing across all other columns
df['Grand Total'] = df.sum(axis=1)
df.columns.name = None
df.index.name = None
# Add the grand total row, summing all values in a column
df.loc['Grand Total', :] = df.sum()
df is now:
Actionable Duplicate Informative Non-actionable Grand Total
2018-05-19 NaN 220.0 3.0 NaN 223.0
2018-05-20 5.0 270.0 859.0 2.0 1136.0
2018-05-21 8.0 295.0 17.0 NaN 320.0
2018-05-22 10.0 424.0 36.0 NaN 470.0
2018-05-23 8.0 157.0 3.0 NaN 168.0
2018-05-24 5.0 78.0 3.0 NaN 86.0
2018-05-25 3.0 80.0 NaN NaN 83.0
Grand Total 39.0 1524.0 921.0 2.0 2486.0
Just using crosstab
pd.crosstab(df['Column4'], df['Column5'], margins = True, margins_name = 'Grand Total' )
Take a look at this :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html
You need to pivot your table :
df.reset_index().pivot(index='date', columns='Column4', values='Column5')

Categories

Resources