Combine Pandas Data Frame if Values Match in a Columns - python

I have 2 data frames that I want to merge/combine based on a condition.
Let's say these are two dfs.
df1
name tpye option store
a 2 8 0
b 4 9 8
c 3 6 2
g 3 2 7
k 1 6 2
m 3 6 5
df2
name red green yellow
a r g y
b r g y
m r g y
What I am trying to do if df2['name'] value exist in df1['name'], add the red, green columns to final_df .
So the final_df would like
name tpye option store red green yellow
a 2 8 0 r g y
b 4 9 8 r g y
c 3 6 2
g 3 2 7
k 1 6 2
m 3 6 5 r g y

Try this. It works because pandas can assign efficiently via index, especially when the index is unique within each dataframe.
df1 = df1.set_index('name')
df2 = df2.set_index('name')
df1[['red', 'green', 'yellow']] = df2[['red', 'green', 'yellow']]
Alternatively, pd.merge will work, as #PaulH mentioned:
df1.merge(df2, how='left', on='name')

You can use pandas join function. Your first dataframe would be the one you want all the values. For example:
import pandas as pd
d1 = pd.DataFrame({'col1': [1, 2 , 4], 'col2': [3, 4 , 5]})
d2 = pd.DataFrame({'col1': [1, 10], 'col3': [3, 4]})
joined = d1.set_index('col1').join(d2.set_index('col1'))
Which gives exactly what you want:
>>joined
col2 col3
col1
1 3 3.0
2 4 NaN
4 5 NaN

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

How to append pandas dataframes with different shapes

I am trying to append two pandas DataFrames that are of different shape. Here are sample DataFrames for reproducibility:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4],
'val': ['x','y','w','z'],
'val1': ['x1','y1','w1','z1']
})
df2 = pd.DataFrame({'id': [5,6,7,8],
'val2': ['x1','y1','w1','z1'],
'val': ['t','s','v','l'],
})
I'd like to append df2 to df1. Expected behavior: Non-matching columns, in this case val2 would just be dropped. Retain column ordering of df1 and reset index in appended DataFrame.
id val val1
0 1 x x1
1 2 y y1
2 3 w w1
3 4 z v1
4 5 t NaN
5 6 s NaN
6 7 v NaN
7 8 l NaN
To clarify, not looking for inner join. Need all the columns from df1 and additionally, also print the columns that didn't intersect between the dataframes.
You can use pd.concat With df.reindex (Edited to match your excepted outout):
pd.concat([df1, df2], ignore_index=True).reindex(df1.columns, axis='columns')
output:
id val val1
0 1 x x1
1 2 y y1
2 3 w w1
3 4 z z1
4 5 t NaN
5 6 s NaN
6 7 v NaN
7 8 l NaN
For columns that didn't intersect, you can either use:
df1.columns.symmetric_difference(df2.columns).tolist()
To get the columns that didn't intersect fully or at all, output:
['val1', 'val2']
OR:
df2.columns.difference(df1.columns).tolist()
To get the columns that didn't intersect at all, output:
['val2']

pandas most efficient way to execute arithmetic operations on multiple dataframe columns

my first post!
I'm running python 3.8.5 & pandas 1.1.0 on jupyter notebooks.
I want to divide several columns by the corresponding elements in another column of the same dataframe.
For example:
import pandas as pd
df = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 6, 8], 'c':[6, 9, 12]})
df
a b c
0 2 4 6
1 3 6 9
2 4 8 12
I'd like to divide columns 'b' & 'c' by the corresponding values in 'a' and substitute the values in 'b' and 'c' with the result of this division. So the above dataframe becomes:
a b c
0 2 2 3
1 3 2 3
2 4 2 3
I tried
df.iloc[: , 1:] = df.iloc[: , 1:] / df['a']
but this gives:
a b c
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
I got it working by doing:
for colname in df.columns[1:]:
df[colname] = (df[colname] / df['a'])
Is there a faster way of doing the above by avoiding the for loop?
thanks,
mk
Almost there, use div with axis=0:
df.iloc[:,1:] = df.iloc[:,1:].div(df.a, axis=0)
df.b= df.b/df.a
df.c=df.c/df.a
or
df[['b','c']]=df.apply(lambda x: x[['b','c']]/x.a ,axis=1)

sum values in different rows and columns dataframe python

My Data Frame
A B C D
2 3 4 5
1 4 5 6
5 6 7 8
How do I add values of different rows and different columns
Column A Row 2 with Column B row 1
Column A Row 3 with Column B row 2
Similarly for all rows
If you only need do this with two columns (and I understand your question well), I think you can use the shift function.
Your data frame (pandas?) is something like:
d = {'A': [2, 1, 5], 'B': [3, 4, 6], 'C': [4, 5, 7], 'D':[5, 6, 8]}
df = pd.DataFrame(data=d)
So, it's possible to create a new data frame with B column shifted:
df2 = df['B'].shift(1)
which gives:
0 NaN
1 3.0
2 4.0
Name: B, dtype: float64
and then, merge this new data with the previous df and, for example, sum the values:
df = df.join(df2, rsuffix='shift')
df['out'] = df['A'] + df['Bshift']
The final output is in out column:
A B C D Bshift out
0 2 3 4 5 NaN NaN
1 1 4 5 6 3.0 4.0
2 5 6 7 8 4.0 9.0
But it's only an intuition, I'm not sure about your question!

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources