Append only matching columns to dataframe - python

I have a sort of 'master' dataframe that I'd like to append only matching columns from another dataframe to
df:
A B C
1 2 3
df_to_append:
A B C D E
6 7 8 9 0
The problem is that when I use df.append(), It also appends the unmatched columns to df.
df = df.append(df_to_append, ignore_index=True)
Out:
A B C D E
1 2 3 NaN NaN
6 7 8 9 0
But my desired output is to drop columns D and E since they are not a part of the original dataframe? Perhaps I need to use pd.concat? I don't think I can use pd.merge since I don't have anything unique to merge on.

Using concat join='inner
pd.concat([df,df_to_append],join='inner')
Out[162]:
A B C
0 1 2 3
0 6 7 8

Just select the columns common to both dfs:
df.append(df_to_append[df.columns], ignore_index=True)

The simplest way would be to get the list of columns common to both dataframes using df.columns, but if you don't know that all of the original columns are included in df_to_append, then you need to find the intersection of the two sets:
cols = list(set(df.columns) & set(df_to_append.columns))
df.append(df_to_append[cols], ignore_index=True)

Related

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Python how to merge two dataframes with multiple columns while preserving row order in each column?

My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Merge dataframes of different sizes and simultaneously overwrite NaN values

I would like to combine two dataframes in Python of different sizes. These dataframes are loaded from Excel files. The first dataframe has many empty values containing NaN, and the second dataframe has the data to replace the NaN values in the first dataframe. The two dataframes are linked by the data in the first column, but are not in the same order.
I can successfully merge and organize the dataframes using merge(), but the resulting dataframe has extra columns because the NaN values were not overwritten. I can overwrite the NaN values with fillna(), but the resulting dataframe is out of order. Is there any way to perform this kind of merge that replaces NaN without separate operations that delete and reorder columns?
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':[1,2,3],'B':[np.nan,np.nan,np.nan],'C':['X','Y','Z']})
df1
A B C
0 1 NaN X
1 2 NaN Y
2 3 NaN Z
df2=pd.DataFrame({'A':[3,1,2],'B':['U','V','W'],'D':[7,8,9]})
df2
A B D
0 3 U 7
1 1 V 8
2 2 W 9
If I do:
df1.merge(df2,how='left',on='A',sort=True)
A B_x C B_y D
0 1 NaN X V 8
1 2 NaN Y W 9
2 3 NaN Z U 7
The data is in order but B has multiple instances.
If I do:
df1.fillna(df2)
A B C
0 1 U X
1 2 V Y
2 3 W Z
The data is out of order, but the NaN are replaced.
I want the output to be a dataframe which looks like this:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
You can use:
df3=pd.concat([df1['C'],df2[['A','B','D']].sort_values('A').reset_index(drop=True)],axis=1).reindex(columns=['A','B','C','D'])
Output:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
Explanation:
sort_values ​​orders df2 according to column A.
reset_index (drop = True) is necessary to concatenate the DataFrame in the correct order.
I use concat to join the column of df1 'C' with df2 whose columns are now in the correct order. Finally I use reindex to reposition the columns of the DataFrame df3.
You can see that the order of the DataFrame df2 has not changed, since we have not used inplace = True.
d = dict(zip(df2.A,df2.B))
df1["B"] = df1["A"].map(d)
del df2["B"]
df1.merge(df2,how='left',on='A',sort=True)

Filling a column based on comparison of 2 other columns (pandas)

I am trying to do the following in pandas:
I have 2 DataFrames, both of which have a number of columns.
DataFrame 1 has a column A, that is of interest for my task;
DataFrame 2 has columns B and C, that are of interest.
What needs to be done: to go through the values in column A and see if the same values exists somewhere in column B. If it does, create a column D in Dataframe 1 and fill its respective cell with the value from C which is on the same row as the found value from B.
If the value from A does not exist in B, then fill the cell in D with a zero.
for i in range(len(df1)):
if df1['A'].iloc[i] in df2.B.values:
df1['D'].iloc[i] = df2['C'].iloc[i]
else:
df1['D'].iloc[i] = 0
This gives me an error: Keyword 'D'. If I create the column D in advance and fill it, for example, with 0's, then I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. How can I solve this? Or is there a better way to accomplish what I'm trying to do?
Thank you so much for your help!
If I understand correctly:
Given these 2 dataframes:
import pandas as pd
import numpy as np
np.random.seed(42)
df1=pd.DataFrame({'A':np.random.choice(list('abce'), 10)})
df2=pd.DataFrame({'B':list('abcd'), 'C':np.random.randn(4)})
>>> df1
A
0 c
1 e
2 a
3 c
4 c
5 e
6 a
7 a
8 c
9 b
>>> df2
B C
0 a 0.279041
1 b 1.010515
2 c -0.580878
3 d -0.525170
You can achieve what you want using a merge:
new_df = df1.merge(df2, left_on='A', right_on='B', how='left').fillna(0)[['A','C']]
And then just rename the columns:
new_df.columns=['A', 'D']
>>> new_df
A D
0 c -0.580878
1 e 0.000000
2 a 0.279041
3 c -0.580878
4 c -0.580878
5 e 0.000000
6 a 0.279041
7 a 0.279041
8 c -0.580878
9 b 1.010515

Categories

Resources