how to merge two dataframes and sum the values of columns - python

I have two dataframes
df1
Name class value
Sri 1 5
Ram 2 8
viv 3 4
df2
Name class value
Sri 1 5
viv 4 4
My desired output is,
df,
Name class value
Sri 2 10
Ram 2 8
viv 7 8
Please help, thanks in advance!

I think need set_index for both DataFrames, add and last reset_index:
df = df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
print (df)
Name class value
0 Ram 2.0 8.0
1 Sri 2.0 10.0
2 viv 7.0 8.0
If values in Name are not unique use groupby and aggregate sum:
df = df1.groupby('Name').sum().add(df2.groupby('Name').sum(), fill_value=0).reset_index()

pd.concat + groupby + sum
You can concatenate your individual dataframes and then group by your key column:
df = pd.concat([df1, df2])\
.groupby('Name')['class', 'value']\
.sum().reset_index()
print(df)
Name class value
0 Ram 2 8
1 Sri 2 10
2 viv 7 8

Related

Left join and sum results

I work with Python and I try to implement the function merge with two tables df_agg and df_total. With this function, I used the argument left with the expectation that from the first table with the title all rows will be covered. For the first table, it is important to consider that the first table contains duplicates in the join column id while the second table does not have duplicates in id.
df_new = pd.merge(df_agg,df_total, on='id', how='left')
The merge command executes successfully.But the results are extraordinary, instead to have the same sum of df_agg['total'] with df_new['total'], results in the df_new['total'] being greater than df_agg.
So can anybody help me with what causes this problem and suggest to me some arguments in the function in order to have the same sum before and after merging?
It means id has duplicates in both DataFrames, so new DataFrame has more rows like df_agg (is created 'product' of duplicated rows by all combinations).
df_agg = pd.DataFrame( {"id": [1,1,2,3,3], 'a':range(5) })
df_total = pd.DataFrame( {"id": [1,1,1,3,4], 'b':range(10,15) })
df_new = pd.merge(df_agg,df_total, on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 0 11.0
2 1 0 12.0
3 1 1 10.0
4 1 1 11.0
5 1 1 12.0
6 2 2 NaN
7 3 3 13.0
8 3 4 13.0
print (len(df_new), len(df_agg))
9 5
Possible solution is remove duplicates:
df_new = pd.merge(df_agg,df_total.drop_duplicates('id'), on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 1 10.0
2 2 2 NaN
3 3 3 13.0
4 3 4 13.0
print (len(df_new), len(df_agg))
5 5

Python merging data frames and renaming column values

In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1

Calculate Mean on Multiple Groups

I have a Table
Sex Value1 Value2 City
M 2 1 Berlin
W 3 5 Paris
W 1 3 Paris
M 2 5 Berlin
M 4 2 Paris
I want to calculate the average of Value1 and Value2 for different groups. In my origial Dataset I have 10 Group variables (with a max of 5 characteristics like 5 Cities) that I have shortened to Sex and City (2 Characteristics) in this example. The result should look like this
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2,4 2,6 2 2 2,66
Value2 3,2 2,6 4 3 3,3
I am familiar with the group by and tried
df.groupby('City').mean()
But here we have the problem that Sex is getting also into the calculation. Does anyone has an idea how to solve this? Thanks in advance!
You can grouping by 2 columns to 2 dataframes and then use concat also with means of numeric columns (non numeric are excluded):
df1 = df.groupby('City').mean().T
df2 = df.groupby('Sex').mean().T
df3 = pd.concat([df.mean().rename('Overall'), df2, df1], axis=1).add_prefix('Avg')
print (df3)
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2.4 2.666667 2.0 2.0 2.666667
Value2 3.2 2.666667 4.0 3.0 3.333333

Want to join last row of two dataframe on condition

quantity:
a b c
3 1 nan
3 2 8
7 5 9
4 8 nan
price
34
I have two dataframes quantity and price and I want to join last row of quantity dataframe to price where c is not nan
I wrote these query but didn't got the desired output:
price = pd.concat(price,quantity["a","b","c"].tail(1).isnotnull())
what I want is like:
price a b c
34 7 5 9
If your dfs are these:
df = pd.DataFrame([[3,1,np.nan], [3,2,8], [7,5,9], [4,8,np.nan]], columns=['a','b','c'])
df2 = pd.DataFrame([34], columns=['price'])
You can do in this way:
final_df = pd.concat([df.dropna(subset=['c']).tail(1).reset_index(drop=True), df2], axis=1)
Output:
a b c price
0 7 5 9.0 34
I believe you need remove missing values and for last row - added double [] for one row DataFrame:
df=pd.concat([price.reset_index(drop=True),
quantity[["a","b","c"]].dropna(subset=['c']).iloc[[-1]].reset_index(drop=True)],
axis=1)
print (df)
price a b c
0 34 7 5 9.0
Detail:
print (quantity[["a","b","c"]].dropna().iloc[[-1]])
a b c
2 7 5 9.0
I would filter the df on not null then simply add the price to it:
new_df = df[df['c'].notnull()]
Where c is your column name.
new_df['price'] = 32 # or the price from your df

Joining two dataframes in pandas using full outer join

I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
I'll really appreciate if someone can help me with this.
Regards,
Kailash Negi
It seems you need combine_first with set_index for match by indices created by columns EmpID:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
For some order of columns need reindex:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN

Categories

Resources