Merge dataframes of different sizes and simultaneously overwrite NaN values - python

I would like to combine two dataframes in Python of different sizes. These dataframes are loaded from Excel files. The first dataframe has many empty values containing NaN, and the second dataframe has the data to replace the NaN values in the first dataframe. The two dataframes are linked by the data in the first column, but are not in the same order.
I can successfully merge and organize the dataframes using merge(), but the resulting dataframe has extra columns because the NaN values were not overwritten. I can overwrite the NaN values with fillna(), but the resulting dataframe is out of order. Is there any way to perform this kind of merge that replaces NaN without separate operations that delete and reorder columns?
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':[1,2,3],'B':[np.nan,np.nan,np.nan],'C':['X','Y','Z']})
df1
A B C
0 1 NaN X
1 2 NaN Y
2 3 NaN Z
df2=pd.DataFrame({'A':[3,1,2],'B':['U','V','W'],'D':[7,8,9]})
df2
A B D
0 3 U 7
1 1 V 8
2 2 W 9
If I do:
df1.merge(df2,how='left',on='A',sort=True)
A B_x C B_y D
0 1 NaN X V 8
1 2 NaN Y W 9
2 3 NaN Z U 7
The data is in order but B has multiple instances.
If I do:
df1.fillna(df2)
A B C
0 1 U X
1 2 V Y
2 3 W Z
The data is out of order, but the NaN are replaced.
I want the output to be a dataframe which looks like this:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7

You can use:
df3=pd.concat([df1['C'],df2[['A','B','D']].sort_values('A').reset_index(drop=True)],axis=1).reindex(columns=['A','B','C','D'])
Output:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
Explanation:
sort_values ​​orders df2 according to column A.
reset_index (drop = True) is necessary to concatenate the DataFrame in the correct order.
I use concat to join the column of df1 'C' with df2 whose columns are now in the correct order. Finally I use reindex to reposition the columns of the DataFrame df3.
You can see that the order of the DataFrame df2 has not changed, since we have not used inplace = True.

d = dict(zip(df2.A,df2.B))
df1["B"] = df1["A"].map(d)
del df2["B"]
df1.merge(df2,how='left',on='A',sort=True)

Related

Concat two data frame row wise which have different column names and different values in row

I have two dataframe which have few column comman and few columns are different. And each dataframe have 1 Row only and contains the information about different runs. so how can I combined then to create 1 dataframe with 2 rows.
ex:
df:
a b c
0 1 2 3
df:
a y c
0 4 5 6
This is just example for two dataframe, but I will be doing it for multiple dataframes with each 1 row.
IIUC, you want to combine dfs and want to keep values together in some list of sort, for that you can do:
pd.concat([df,df2]).reset_index().groupby('index').agg(list).reset_index(drop=True)
a b c y
0 [1, 4] [2.0, nan] [3, 6] [nan, 5.0]
OR, if you just want to combine them then, pd.concat does it
pd.concat([df,df2]).reset_index(drop=True)
a b c y
0 1 2.0 3 NaN
1 4 NaN 6 5.0

Join all columns from multiple pandas dataframes into one dataframe with data and column names

I have N Dataframes with different number of columns, I want to get one dataframe with 2 columns x and Y where x is the data from the columns of the input dataframe and Y is the column name itself. I have many such dataframes that I need to concat (N is of the order of 10^2), so efficiency is of priority. A numpy way rather than pandas way is also welcome.
For example,
df1:
one two
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
df2:
three four
0 NaN
1 None f
2 g
3 6 7
Final Output Dataframe:
x y
0 1 one
1 2 one
2 3 one
3 4 one
4 5 one
5 a two
6 b two
7 c two
8 d two
9 e two
10 6 three
11 f four
12 g four
13 7 four
Note: I'm ignoring empty strings, NaNs and Nones in the final dataframe.
IIUC you can use melt() before concating:
final=(pd.concat([df1.melt(),df2.dropna().melt()]).
rename(columns={'variable':'y','value':'x'}). reindex(['x','y'],axis=1))
print(final)

Append only matching columns to dataframe

I have a sort of 'master' dataframe that I'd like to append only matching columns from another dataframe to
df:
A B C
1 2 3
df_to_append:
A B C D E
6 7 8 9 0
The problem is that when I use df.append(), It also appends the unmatched columns to df.
df = df.append(df_to_append, ignore_index=True)
Out:
A B C D E
1 2 3 NaN NaN
6 7 8 9 0
But my desired output is to drop columns D and E since they are not a part of the original dataframe? Perhaps I need to use pd.concat? I don't think I can use pd.merge since I don't have anything unique to merge on.
Using concat join='inner
pd.concat([df,df_to_append],join='inner')
Out[162]:
A B C
0 1 2 3
0 6 7 8
Just select the columns common to both dfs:
df.append(df_to_append[df.columns], ignore_index=True)
The simplest way would be to get the list of columns common to both dataframes using df.columns, but if you don't know that all of the original columns are included in df_to_append, then you need to find the intersection of the two sets:
cols = list(set(df.columns) & set(df_to_append.columns))
df.append(df_to_append[cols], ignore_index=True)

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

I am having trouble analysing origin-destination values in a pandas dataframe which contains origin/destination columns and a count column of the frequency of these. I want to transform this into a dataframe with the count of how many are leaving and entering:
Initial:
Origin Destination Count
A B 7
A C 1
B A 1
B C 4
C A 3
C B 10
For example this simplified dataframe has 7 leaving from A to B and 1 from A to C so overall leaving place A would be 8, and entering place A would be 4 (B - A is 1, C - A is 3) etc. The new dataframe would look something like this.
Goal:
Place Entering Leaving
A 4 8
B 17 5
C 5 13
I have tried several techniques such as .groupby() but have not yet created my intended dataframe. How can I handle the repeated values in the origin/destination columns and assign them to a new dataframe with aggregated values of just the count of leaving and entering?
Thank you!
Use double groupby + concat:
a = df.groupby('Destination')['Count'].sum()
b = df.groupby('Origin')['Count'].sum()
df = pd.concat([a,b], axis=1, keys=('Entering','Leaving')).rename_axis('Place').reset_index()
print (df)
Place Entering Leaving
0 A 4 8
1 B 17 5
2 C 5 13
pivot_table then do sum
df=pd.pivot_table(df,index='Origin',columns='Destination',values='Count',aggfunc=sum)
pd.concat([df.sum(0),df.sum(1)],1)
Out[428]:
0 1
A 4.0 8.0
B 17.0 5.0
C 5.0 13.0

Categories

Resources