How to append pandas dataframes with different shapes - python

I am trying to append two pandas DataFrames that are of different shape. Here are sample DataFrames for reproducibility:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4],
'val': ['x','y','w','z'],
'val1': ['x1','y1','w1','z1']
})
df2 = pd.DataFrame({'id': [5,6,7,8],
'val2': ['x1','y1','w1','z1'],
'val': ['t','s','v','l'],
})
I'd like to append df2 to df1. Expected behavior: Non-matching columns, in this case val2 would just be dropped. Retain column ordering of df1 and reset index in appended DataFrame.
id val val1
0 1 x x1
1 2 y y1
2 3 w w1
3 4 z v1
4 5 t NaN
5 6 s NaN
6 7 v NaN
7 8 l NaN
To clarify, not looking for inner join. Need all the columns from df1 and additionally, also print the columns that didn't intersect between the dataframes.

You can use pd.concat With df.reindex (Edited to match your excepted outout):
pd.concat([df1, df2], ignore_index=True).reindex(df1.columns, axis='columns')
output:
id val val1
0 1 x x1
1 2 y y1
2 3 w w1
3 4 z z1
4 5 t NaN
5 6 s NaN
6 7 v NaN
7 8 l NaN
For columns that didn't intersect, you can either use:
df1.columns.symmetric_difference(df2.columns).tolist()
To get the columns that didn't intersect fully or at all, output:
['val1', 'val2']
OR:
df2.columns.difference(df1.columns).tolist()
To get the columns that didn't intersect at all, output:
['val2']

Related

replace empty strings in a dataframe with values from another dataframe iwth a different index

two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Combine Pandas Data Frame if Values Match in a Columns

I have 2 data frames that I want to merge/combine based on a condition.
Let's say these are two dfs.
df1
name tpye option store
a 2 8 0
b 4 9 8
c 3 6 2
g 3 2 7
k 1 6 2
m 3 6 5
df2
name red green yellow
a r g y
b r g y
m r g y
What I am trying to do if df2['name'] value exist in df1['name'], add the red, green columns to final_df .
So the final_df would like
name tpye option store red green yellow
a 2 8 0 r g y
b 4 9 8 r g y
c 3 6 2
g 3 2 7
k 1 6 2
m 3 6 5 r g y
Try this. It works because pandas can assign efficiently via index, especially when the index is unique within each dataframe.
df1 = df1.set_index('name')
df2 = df2.set_index('name')
df1[['red', 'green', 'yellow']] = df2[['red', 'green', 'yellow']]
Alternatively, pd.merge will work, as #PaulH mentioned:
df1.merge(df2, how='left', on='name')
You can use pandas join function. Your first dataframe would be the one you want all the values. For example:
import pandas as pd
d1 = pd.DataFrame({'col1': [1, 2 , 4], 'col2': [3, 4 , 5]})
d2 = pd.DataFrame({'col1': [1, 10], 'col3': [3, 4]})
joined = d1.set_index('col1').join(d2.set_index('col1'))
Which gives exactly what you want:
>>joined
col2 col3
col1
1 3 3.0
2 4 NaN
4 5 NaN

Adding a new column to a pandas dataframe with different number of rows

I'm not sure if pandas is made to do this... But I'd like to add a new row to my dataframe with more rows than the existing columns.
Minimal example:
import pandas as pd
df = pd.DataFrame()
df ['a'] = [0,1]
df ['b'] = [0,1,2]
Could someone please explain if this is possible? I'm using a dataframe to store long lists of data and they all have different lengths that I don't necessarily know at the start.
Absolutely possible. Use pd.concat
Demonstration
df1 = pd.DataFrame([[1, 2, 3]])
df2 = pd.DataFrame([[4, 5, 6, 7, 8, 9]])
pd.concat([df1, df2])
df1 looks like
0 1 2
0 1 2 3
df2 looks like
0 1 2 3 4 5
0 4 5 6 7 8 9
pd.concat looks like
0 1 2 3 4 5
0 1 2 3 NaN NaN NaN
0 4 5 6 7.0 8.0 9.0

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources