difference between two dataframes in Pandas

difference between two dataframes in Pandas - python

I am trying to find difference between two dataframe and the resulting df should return the rows matching the first dataframe. Since id's 6,7 was not there in df2 so the count value is as it is.
My Two Dataframes
Resulting Dataframe:

Use sub with set_index for align DataFrames by id columns, add reindex for id only by df1.id:
df = (df1.set_index('id')
.sub(df2.set_index('id'), fill_value=0)
.reindex(df1['id'])
.astype(int)
.reset_index())
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Another solution with merge and left join, then subtract by sub with extracting count_ column by pop:
df = df1.merge(df2, on='id', how='left', suffixes=('','_'))
df['count'] = df['count'].sub(df.pop('count_'), fill_value=0).astype(int)
print (df)
id count
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 9
6 7 4
Setup:
df1 = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'count':[3,5,6,7,2,9,4]})
print (df1)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 6 9
6 7 4
df2 = pd.DataFrame({'id':[1,2,3,4,5,8,9],
'count':[3,5,6,7,2,4,2]})
print (df2)
id count
0 1 3
1 2 5
2 3 6
3 4 7
4 5 2
5 8 4
6 9 2

Use:
temp = pd.merge(df1, df2, how='left', on='id').fillna(0)
temp['count'] = temp['count_x'] - temp['count_y']
temp[['id', 'count']]
id count
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 0.0
5 6 9.0
6 7 4.0

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7

You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

Group identical consecutive values in pandas DataFrame

I have the following pandas dataframe :
a
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 2
8 2
9 1
I want to store the values in another dataframe such as every group of consecutive indentical values make a labeled group like this :
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
The column A represent the value of the group and B represents the number of occurences.
this is what i've done so far:
df = pd.DataFrame({'a':[0,0,1,2,2,2,3,2,2,1]})
df2 = pd.DataFrame()
for i,g in df.groupby([(df.a != df.a.shift()).cumsum()]):
vc = g.a.value_counts()
df2 = df2.append({'A':vc.index[0], 'B': vc.iloc[0]}, ignore_index=True).astype(int)
It works but it's a bit messy.
Do you think of a shortest/better way of doing this ?

use GrouBy.agg in Pandas >0.25.0:
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.agg(A=('a','first'),B=('a','count')) )
print(new_df)
A B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1
pandas <0.25.0
new_df= ( df.groupby(df['a'].ne(df['a'].shift()).cumsum(),as_index=False)
.a
.agg({'A':'first','B':'count'}) )

I would try:
df['blocks'] = df['a'].ne(df['a'].shift()).cumsum()
(df.groupby(['a','blocks'],
as_index=False,
sort=False)
.count()
.drop('blocks', axis=1)
)
Output:
a B
0 0 2
1 1 1
2 2 3
3 3 1
4 2 2
5 1 1

Sort dataframe by another on one column - pandas

Let's say i have to data-frames, as shown below:
df=pd.DataFrame({'a':[1,4,3,2],'b':[1,2,3,4]})
df2=pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[34,56,7,55]})
I would like to sort df data by the order df2 data on 'a' column, so the df.a column would be the order of df2.a and that which makes the whole data-frame that order.
Desired output:
a b
0 1 1
1 2 4
2 3 3
3 4 2
(made it manually, and if there's any mistake with it, please tell me :D)
My own attempt:
df = df.set_index('a')
df = df.reindex(index=df2['a'])
df = df.reset_index()
print(df)
Works as expected!!!,
But when i have longer data-frames, like:
df=pd.DataFrame({'a':[1,4,3,2,3,4,5,3,5,6],'b':[1,2,3,4,5,5,5,6,6,7]})
df2=pd.DataFrame({'a':[1,2,3,4,3,4,5,6,4,5],'b':[1,2,4,3,4,5,6,7,4,3]})
It doesn't work ass expected.
Note: i don't only want a explanation of why but i also need a solution to do it for big data-frames

One possible solution is create helper columns in both DataFrames, because duplicated values:
df['g'] = df.groupby('a').cumcount()
df2['g'] = df2.groupby('a').cumcount()
df = df.set_index(['a','g']).reindex(index=df2.set_index(['a','g']).index)
print(df)
b
a g
1 0 1.0
2 0 4.0
3 0 3.0
4 0 2.0
3 1 5.0
4 1 5.0
5 0 5.0
6 0 7.0
4 2 NaN
5 1 6.0
Or maybe need merge:
df3 = df.merge(df2[['a','g']], on=['a','g'])
print(df3)
a b g
0 1 1 0
1 4 2 0
2 3 3 0
3 2 4 0
4 3 5 1
5 4 5 1
6 5 5 0
7 5 6 1
8 6 7 0

Append Pandas disjunction of 2 dataframes to first dataframe

Given 2 pandas tables, both with the 3 columns id, x and y coordinates. So several rows of same id represent a graph with its x-yvalues. How would I find paths that do not exist in the first table, but in the second and append them to 1st table? Key problem is that the order of the graphs in both tables can be different.
Example:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]})
(df1 intersect df2 ) ---------> df1
id x y id x y id x y
1 1 1 1 1 4 1 1 1
1 1 2 1 1 5 1 1 2
2 5 4 1 1 6 2 5 4
2 4 4 2 1 1 2 4 4
2 4 3 2 1 2 2 4 3
3 1 4 3 5 4 3 1 4
3 1 5 3 4 4 3 1 5
3 1 6 3 4 3 3 1 6
4 10 1 4 10 1
4 10 2 4 10 2
4 9 2 4 9 2
Should become:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3,4,4,4], 'x':[1,1,5,4,4,1,1,1,10,10,9], 'y':[1,2,4,4,3,4,5,6,1,2,2]})
As you can see until id= 3, df1 and df2 have similar graphs, but their order is different from one to another table. In this case for example df1 first graph is df2 seconds graph. Now df2 has a 4th path that is not in df1. In that case the 4th path should be detected and appended to df1. Like that I want to get the intersection of the 2 pandas table and append the disjunction of the both to the first table, with the condition that the id, so to say the order of the paths can be different from one and another.

Imports:
import pandas as pd
Set starting DataFrames:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
Outer Merge:
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
produces:
df_merged =
id_x x y id_y
0 1.0 1 1 2
1 1.0 1 2 2
2 2.0 5 4 3
3 2.0 4 4 3
4 2.0 4 3 3
5 3.0 1 4 1
6 3.0 1 5 1
7 3.0 1 6 1
8 NaN 10 1 4
9 NaN 10 2 4
10 NaN 9 2 4
Note: Why does id_x become floats?
Fill NaN:
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
produces:
df_merged =
id_x x y id_y
0 1 1 1 2
1 1 1 2 2
2 2 5 4 3
3 2 4 4 3
4 2 4 3 3
5 3 1 4 1
6 3 1 5 1
7 3 1 6 1
8 4 10 1 4
9 4 10 2 4
10 4 9 2 4
Drop id_y:
df_merged = df_merged.drop(['id_y'], axis=1)
produces:
df_merged =
id_x x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Rename id_x to id:
df_merged = df_merged.rename(columns={'id_x': 'id'})
produces:
df_merged =
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
8 4 10 1
9 4 10 2
10 4 9 2
Final Program is 4 lines of code:
import pandas as pd
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3],
'x':[1,1,5,4,4,1,1,1],
'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4],
'x':[1,1,1,1,1,5,4,4,10,10,9],
'y':[4,5,6,1,2,4,4,3,1,2,2]})
df_merged = df1.merge(df2, on=['x', 'y'], how='outer')
df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int')
df_merged = df_merged.drop(['id_y'], axis=1)
df_merged = df_merged.rename(columns={'id_x': 'id'})
Please remember to put a check next to the selected answer.

Mauritius, try this code:
df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]})
df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4,5], 'x':[1,1,1,1,1,5,4,4,10,10,9,1], 'y':[4,5,6,1,2,4,4,3,1,2,2,2]})
df1_s = [{(x,y) for x, y in df1[['x','y']][df1.id==i].values} for i in df1.id.unique()]
def f(df2):
data = {(x,y) for x, y in df2[['x','y']].values}
if data not in df1_s:
return True
else:
return False
check = df2.groupby('id').apply(f).apply(pd.Series)
ids = check[check[0]].index.values
df2 = df2.set_index('id').loc[ids].reset_index()
df1 = df1.append(df2)
OUT:
id x y
0 1 1 1
1 1 1 2
2 2 5 4
3 2 4 4
4 2 4 3
5 3 1 4
6 3 1 5
7 3 1 6
0 4 10 1
1 4 10 2
2 4 9 2
3 5 1 2
I think it can be done more simple and pythonic, but I think a lot and still don't know how = )
And I think, should to check ids is not the same in df1 and df2, before append one df to another (in the end). I might add this later.
Does this code do what you want?

Python, pandas, cumulative sum in new column on matching groups

If I have these columns in a dataframe:
a b
1 5
1 7
2 3
1,2 3
2 5
How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group.
a b c
1 5 15
1 7 15
2 3 11
1,2 3 26
2 5 11
Is there an easy and efficient solution as the dataframe I have is very large.

You can first need split column a and join it to original DataFrame:
print (df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
0 1
1 1
2 2
3 1
3 2
4 2
Name: a, dtype: object
df1 = df.drop('a', axis=1)
.join(df.a.str.split(',', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('a'))
print (df1)
b a
0 5 1
1 7 1
2 3 2
3 3 1
3 3 2
4 5 2
Then use transform for sum without aggragation.
df1['c'] = df1.groupby(['a'])['b'].transform(sum)
#cast for aggreagation join working with strings
df1['a'] = df1.a.astype(str)
print (df1)
b a c
0 5 1 15
1 7 1 15
2 3 2 11
3 3 1 15
3 3 2 11
4 5 2 11
Last groupby by index and aggregate columns by agg:
print (df1.groupby(level=0)
.agg({'a':','.join,'b':'first' ,'c':sum})
[['a','b','c']] )
a b c
0 1 5 15
1 1 7 15
2 2 3 11
3 1,2 3 26
4 2 5 11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

difference between two dataframes in Pandas - python

I am trying to find difference between two dataframe and the resulting df should return the rows matching the first dataframe. Since id's 6,7 was not there in df2 so the count value is as it is. My Two Dataframes Resulting Dataframe:

Use: temp = pd.merge(df1, df2, how='left', on='id').fillna(0) temp['count'] = temp['count_x'] - temp['count_y'] temp[['id', 'count']] id count 0 1 0.0 1 2 0.0 2 3 0.0 3 4 0.0 4 5 0.0 5 6 9.0 6 7 4.0

Related

Merge two DataFrames by combining duplicates and concatenating nonduplicates

Group identical consecutive values in pandas DataFrame

Sort dataframe by another on one column - pandas

Append Pandas disjunction of 2 dataframes to first dataframe

Python, pandas, cumulative sum in new column on matching groups

Categories

Resources