say I have the following databases (suppose they are Dask data frames:
df A =
1
1
2
2
2
2
3
4
5
5
5
5
5
5
df B =
1
2
2
3
3
3
4
5
5
5
and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...).
In other words the resulting DataFrame should be like this:
df C=
1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5
Is there a way to do that in Dask?
Thank you
Notes:
There are various ways to merge dask dataframes. Dask provides various built-in modules, such as: dask.dataframe.DataFrame.join, dask.dataframe.multi.concat, dask.dataframe.DataFrame.merge, dask.dataframe.multi.merge, dask.dataframe.multi.merge_asof. Depending on one's requirements one might want to use a specific one.
This thread has really valuable information on merges. Even though its focus is on Pandas, it will allow one to understand left, right, outer, and inner merges.
If one wants to do it with Pandas dataframes, there are various ways to achieve that.
One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id, and then apply a custom made function. Let's invest a bit more time in that approach.
We will first create a dataframe to store the number of rows that each dataframe has per sample_id as follows
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})
df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())
df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())
As it will be helpful, let us create a column df_max that will store the dataframe that has more rows per sample_id
df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)
[Out]:
sample_id df_a df_b df_max
0 1 2 1 df_a
1 2 4 2 df_a
2 3 1 3 df_b
3 4 1 1 df_a
4 5 6 3 df_a
A one-liner to create the desired df_count would look like the following
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))
Now, given df_a, df_b, and df_count, one will want a function to merge the dataframes based on a specific condition:
If df_max is df_a, then take the rows from df_a.
If df_max is df_b, then take the rows from df_b.
One can create a function merge_df that takes df_a, df_b, and df_count and returns the merged dataframe
def merge_df(df_a, df_b, df_count):
# Create a list to store the dataframes
df_list = []
# Iterate over the rows in df_count
for index, row in df_count.iterrows():
# If df_max is df_a, then take the rows from df_a
if row['df_max'] == 'df_a':
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# If df_max is df_b, then take the rows from df_b
elif row['df_max'] == 'df_b':
df_list.append(df_b[df_b['sample_id'] == row['sample_id']])
# If df_max is neither df_a nor df_b, then use the first dataframe
else:
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# Concatenate the dataframes in df_list and return the result. Also, reset the index.
return pd.concat(df_list).reset_index(drop=True)
Then one can apply the function
df_merged = merge_df(df_a, df_b, df_count)
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 5
12 5
13 5
14 5
15 5
Related
Looking for a way to merge a list of Pandas dataframes for the following effect.
Multiple dataframes as shown below, df1 - df3, combined into new_df.
df1:
0
1
2
Category
A
B
C
Lunch
17
11
6
df2:
0
1
2
Category
A
B
C
Dinner
1
3
5
df3:
0
1
2
Category
A
B
C
Snacks
11
1
6
new_df:
0
1
2
Category
A
B
C
Lunch
17
11
6
Dinner
1
3
5
Snacks
11
1
6
I have tried:
pdList = [df1,df2,df3]
new_df = pd.concat(pdList)
But it doesnt merge the Category A B C entries as it is concatenation.
Thus Category A B C is kept from every dataframe. It is required once, as the figures relate to the category letter above it.
Is there a way to use merge on multiple dfs and get this effect?
So your problem here my dude, is that you are trying to concatenate based on your category index. Which is something totally unnecessary since you have it in all of your dfs
Just drop the category column. On your dfs objects, and mantain it in at least one
list_df = [#dfs you want to drop it in]
pdList = []
for i in list_df:
pdList.append(i.drop('Category',axis = 0,inplace = True))
pdList.append(#Df you want to keep the category row)
After this a simple, pd.concat([your_df])
Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())
I have two datasets with the same attributes but in some of the rows the information is changed. I want to extract the rows in which the information has been changed.
pandas offers a rich API that you can use to manipulate data whatever you want. merge method is one of them. merge is high performance in-memory join operations idiomatically very similar to relational databases like SQL.
df1 = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
print(df1)
A B
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'A':[1,10,3],'B':[4,5,6]})
print(df2)
A B
0 1 4
1 10 5
2 3 6
df3 = df1.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
A B _merge
0 1 4 both
1 3 6 both
2 10 5 right_only
as you can see there is new column named _merge has description of how this row merged, both means that this row exists in both data frames , where right_only means that this row exist only on right data frame which in this case is df2
if you want to get only row had been changed than you can filter on _merge column
df3 = df3[df3['_merge']=='right_only']
A B _merge
2 10 5 right_only
Note: you can do merge using left join by change how argument to left and this will
grap every thing in left data frame (df1) and if row exists in df2 as well then _merge column will show up both. take a look at here for more details