Merge two DataFrames by combining duplicates and concatenating nonduplicates

Merge two DataFrames by combining duplicates and concatenating nonduplicates - python

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7

You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

Related

search for duplicated consecutive rows and put in additional column pandas

I have a df:
df1
a b c d
0 2 4 1
0 2 5 1
0 1 6 2
1 2 7 2
1 1 8 1
1 1 4 1
I need to group by a and b and if two consecutive values in d are = 1 within groups, I want c in a column next to the row . Like:
df1
a b c d c1
0 2 4 1 5
0 1 6 2 nan
1 2 7 2 nan
1 1 8 1 4
Any ideas?
I tried
df1.groupby([df1.a, df1.b, d.diff().ne(0)]
then loc() only the rows with 1s and merge the two dataframes again, but the first function is not completely correct.

Sort dataframe by another on one column - pandas

Let's say i have to data-frames, as shown below:
df=pd.DataFrame({'a':[1,4,3,2],'b':[1,2,3,4]})
df2=pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[34,56,7,55]})
I would like to sort df data by the order df2 data on 'a' column, so the df.a column would be the order of df2.a and that which makes the whole data-frame that order.
Desired output:
a b
0 1 1
1 2 4
2 3 3
3 4 2
(made it manually, and if there's any mistake with it, please tell me :D)
My own attempt:
df = df.set_index('a')
df = df.reindex(index=df2['a'])
df = df.reset_index()
print(df)
Works as expected!!!,
But when i have longer data-frames, like:
df=pd.DataFrame({'a':[1,4,3,2,3,4,5,3,5,6],'b':[1,2,3,4,5,5,5,6,6,7]})
df2=pd.DataFrame({'a':[1,2,3,4,3,4,5,6,4,5],'b':[1,2,4,3,4,5,6,7,4,3]})
It doesn't work ass expected.
Note: i don't only want a explanation of why but i also need a solution to do it for big data-frames

One possible solution is create helper columns in both DataFrames, because duplicated values:
df['g'] = df.groupby('a').cumcount()
df2['g'] = df2.groupby('a').cumcount()
df = df.set_index(['a','g']).reindex(index=df2.set_index(['a','g']).index)
print(df)
b
a g
1 0 1.0
2 0 4.0
3 0 3.0
4 0 2.0
3 1 5.0
4 1 5.0
5 0 5.0
6 0 7.0
4 2 NaN
5 1 6.0
Or maybe need merge:
df3 = df.merge(df2[['a','g']], on=['a','g'])
print(df3)
a b g
0 1 1 0
1 4 2 0
2 3 3 0
3 2 4 0
4 3 5 1
5 4 5 1
6 5 5 0
7 5 6 1
8 6 7 0

Pandas create new column based on first unique values of existing column

I'm trying to add a new column to a dataframe with only unique values from an existing column. There will be fewer rows in the new column maybe with np.nan values where duplicates would have been.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[3,4,3,4,5]})
df
a b
0 1 3
1 2 4
2 3 3
3 4 4
4 5 5
Goal:
a b c
0 1 3 3
1 2 4 4
2 3 3 nan
3 4 4 nan
4 5 5 5
I've tried:
df['c'] = np.where(df['b'].unique(), df['b'], np.nan)
It throws: operands could not be broadcast together with shapes (3,) (5,) ()

mask + duplicated
You can use Pandas methods for masking a series:
df['c'] = df['b'].mask(df['b'].duplicated())
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0

Use duplicated with np.where:
df['c'] = np.where(df['b'].duplicated(),np.nan,df['b'])
Or:
df['c'] = df['b'].where(~df['b'].duplicated(),np.nan)
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0

difference between two dataframes in Pandas

I am trying to find difference between two dataframe and the resulting df should return the rows matching the first dataframe. Since id's 6,7 was not there in df2 so the count value is as it is.
My Two Dataframes
Resulting Dataframe:

Use sub with set_index for align DataFrames by id columns, add reindex for id only by df1.id:
df = (df1.set_index('id')
.sub(df2.set_index('id'), fill_value=0)
.reindex(df1['id'])
.astype(int)
.reset_index())
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Another solution with merge and left join, then subtract by sub with extracting count_ column by pop:
df = df1.merge(df2, on='id', how='left', suffixes=('','_'))
df['count'] = df['count'].sub(df.pop('count_'), fill_value=0).astype(int)
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Setup:
df1 = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'count':[3,5,6,7,2,9,4]})
print (df1)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 6 9
6 7 4
df2 = pd.DataFrame({'id':[1,2,3,4,5,8,9],
'count':[3,5,6,7,2,4,2]})
print (df2)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 8 4
6 9 2

Use:
temp = pd.merge(df1, df2, how='left', on='id').fillna(0)
temp['count'] = temp['count_x'] - temp['count_y']
temp[['id', 'count']]
id count
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 0.0
5 6 9.0
6 7 4.0

Python, pandas, cumulative sum in new column on matching groups

If I have these columns in a dataframe:
a b
1 5
1 7
2 3
1,2 3
2 5
How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group.
a b c
1 5 15
1 7 15
2 3 11
1,2 3 26
2 5 11
Is there an easy and efficient solution as the dataframe I have is very large.

You can first need split column a and join it to original DataFrame:
print (df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
0 1
1 1
2 2
3 1
3 2
4 2
Name: a, dtype: object
df1 = df.drop('a', axis=1)
.join(df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
print (df1)
b a
0 5 1
1 7 1
2 3 2
3 3 1
3 3 2
4 5 2
Then use transform for sum without aggragation.
df1['c'] = df1.groupby(['a'])['b'].transform(sum)
#cast for aggreagation join working with strings
df1['a'] = df1.a.astype(str)
print (df1)
b a c
0 5 1 15
1 7 1 15
2 3 2 11
3 3 1 15
3 3 2 11
4 5 2 11
Last groupby by index and aggregate columns by agg:
print (df1.groupby(level=0)
.agg({'a':','.join,'b':'first' ,'c':sum})
[['a','b','c']] )
a b c
0 1 5 15
1 1 7 15
2 2 3 11
3 1,2 3 26
4 2 5 11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two DataFrames by combining duplicates and concatenating nonduplicates - python

You can use combine_first: df2 = df2.combine_first(df) Output: A B C 0 1 3.0 5 1 2 4.0 6 2 3 NaN 7

Related

search for duplicated consecutive rows and put in additional column pandas

Sort dataframe by another on one column - pandas

Pandas create new column based on first unique values of existing column

difference between two dataframes in Pandas

Python, pandas, cumulative sum in new column on matching groups

Categories

Resources