Merge Multiple Dataframes Pandas and drop duplicate row - python

Looking for a way to merge a list of Pandas dataframes for the following effect.
Multiple dataframes as shown below, df1 - df3, combined into new_df.
df1:
0
1
2
Category
A
B
C
Lunch
17
11
6
df2:
0
1
2
Category
A
B
C
Dinner
1
3
5
df3:
0
1
2
Category
A
B
C
Snacks
11
1
6
new_df:
0
1
2
Category
A
B
C
Lunch
17
11
6
Dinner
1
3
5
Snacks
11
1
6
I have tried:
pdList = [df1,df2,df3]
new_df = pd.concat(pdList)
But it doesnt merge the Category A B C entries as it is concatenation.
Thus Category A B C is kept from every dataframe. It is required once, as the figures relate to the category letter above it.
Is there a way to use merge on multiple dfs and get this effect?

So your problem here my dude, is that you are trying to concatenate based on your category index. Which is something totally unnecessary since you have it in all of your dfs
Just drop the category column. On your dfs objects, and mantain it in at least one
list_df = [#dfs you want to drop it in]
pdList = []
for i in list_df:
pdList.append(i.drop('Category',axis = 0,inplace = True))
pdList.append(#Df you want to keep the category row)
After this a simple, pd.concat([your_df])

Related

Merging many to many Dask

say I have the following databases (suppose they are Dask data frames:
df A =
1
1
2
2
2
2
3
4
5
5
5
5
5
5
df B =
1
2
2
3
3
3
4
5
5
5
and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...).
In other words the resulting DataFrame should be like this:
df C=
1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5
Is there a way to do that in Dask?
Thank you
Notes:
There are various ways to merge dask dataframes. Dask provides various built-in modules, such as: dask.dataframe.DataFrame.join, dask.dataframe.multi.concat, dask.dataframe.DataFrame.merge, dask.dataframe.multi.merge, dask.dataframe.multi.merge_asof. Depending on one's requirements one might want to use a specific one.
This thread has really valuable information on merges. Even though its focus is on Pandas, it will allow one to understand left, right, outer, and inner merges.
If one wants to do it with Pandas dataframes, there are various ways to achieve that.
One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id, and then apply a custom made function. Let's invest a bit more time in that approach.
We will first create a dataframe to store the number of rows that each dataframe has per sample_id as follows
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})
df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())
df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())
As it will be helpful, let us create a column df_max that will store the dataframe that has more rows per sample_id
df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)
[Out]:
sample_id df_a df_b df_max
0 1 2 1 df_a
1 2 4 2 df_a
2 3 1 3 df_b
3 4 1 1 df_a
4 5 6 3 df_a
A one-liner to create the desired df_count would look like the following
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))
Now, given df_a, df_b, and df_count, one will want a function to merge the dataframes based on a specific condition:
If df_max is df_a, then take the rows from df_a.
If df_max is df_b, then take the rows from df_b.
One can create a function merge_df that takes df_a, df_b, and df_count and returns the merged dataframe
def merge_df(df_a, df_b, df_count):
# Create a list to store the dataframes
df_list = []
# Iterate over the rows in df_count
for index, row in df_count.iterrows():
# If df_max is df_a, then take the rows from df_a
if row['df_max'] == 'df_a':
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# If df_max is df_b, then take the rows from df_b
elif row['df_max'] == 'df_b':
df_list.append(df_b[df_b['sample_id'] == row['sample_id']])
# If df_max is neither df_a nor df_b, then use the first dataframe
else:
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# Concatenate the dataframes in df_list and return the result. Also, reset the index.
return pd.concat(df_list).reset_index(drop=True)
Then one can apply the function
df_merged = merge_df(df_a, df_b, df_count)
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 5
12 5
13 5
14 5
15 5

Combine (merge/join/concat) two dataframes by mask (leave only first matches) in pandas [python]

I have df1:
match
0 a
1 a
2 b
And I have df2:
match number
0 a 1
1 b 2
2 a 3
3 a 4
I want to combine these two dataframes so that only first matches remained, like this:
match_df1 match_df2 number
0 a a 1
1 a a 3
2 b b 2
I've tried different combinations of inner join, merge and pd.concat, but nothing gave me anything close to the desired output. Is there any pythonic way to make it without any loops, just with pandas methods?
Update:
For now came up with this solution. Not sure if it's the most efficient. Your help would be appreciated!
df = pd.merge(df1, df2, on='match').drop_duplicates('number')
for match, count in df1['match'].value_counts().iteritems():
df = df.drop(index=df[df['match'] == match][count:].index)
In your case you can do with groupby and cumcount before merge ,Notice I do not keep two match columns since they are the same
df1['key'] = df1.groupby('match').cumcount()
df2['key'] = df2.groupby('match').cumcount()
out = df1.merge(df2)
Out[418]:
match key number
0 a 0 1
1 a 1 3
2 b 0 2

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Python how to merge two dataframes with multiple columns while preserving row order in each column?

My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Categories

Resources