Finding duplicates in two dataframes and removing the duplicates from one dataframe - python

Working in Python / pandas / dataframes
I have these two dataframes:
Dataframe one:
1 2 3
1 Stockholm 100 250
2 Stockholm 150 376
3 Stockholm 105 235
4 Stockholm 109 104
5 Burnley 145 234
6 Burnley 100 250
Dataframe two:
1 2 3
1 Stockholm 100 250
2 Stockholm 117 128
3 Stockholm 105 235
4 Stockholm 100 250
5 Burnley 145 234
6 Burnley 100 953
And I would like to find the duplicate rows found in Dataframe one and Dataframe two and remove the duplicates from Dataframe one. As in data frame two, you can find row 1, 3, 5 in data frame one, which would remove them from data frame on and create the below:
1 2 3
1 Stockholm 150 376
2 Stockholm 109 104
3 Burnley 100 250

Use:
df_merge = pd.merge(df1, df2, on=[1,2,3], how='inner')
df1 = df1.append(df_merge)
df1['Duplicated'] = df1.duplicated(keep=False) # keep=False marks the duplicated row with a True
df_final = df1[~df1['Duplicated']] # selects only rows which are not duplicated.
del df_final['Duplicated'] # delete the indicator column
The idea is as follows:
do a inner join on all the columns
append the output of the inner join to df1
identify the duplicated rows in df1
select the not duplicated rows in df1
Each number corresponds to each line of code.

A one-liner:
df1.merge(df2, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
Output:
1 2 3
2 Stockholm 150 376
4 Stockholm 109 104
6 Burnley 100 250

Related

Checking for a string in two different dataframes and copy the corresponding rows to calculate statistics in Pandas

I want to write a python code and have for example 2 different DataFrames (the number of dataframes can be more than 2) as follows:
df1 =
Index Name Age Height
0 Tom 20 166
1 Bill 27 170
2 Jacob 39 180
3 Vivian 26 155
df2 =
Index Name Age Height
0 Mary 20 166
1 Tom 27 170
2 Bill 39 180
3 Jack 26 155
I want to check the names in both the dataframes and if they match add corresponding entries in the columns so that the final result looks like a third dataframe:
result =
Index Name Age Height
0 Tom 47 336
1 Bill 66 350
2 Jacob 39 180
3 Vivian 26 155
4 Mary 20 166
5 Jack 26 155
Tom and Bill have 2 entries in 2 dataframes, so their Age and Height get added and others have a single entry, so the original number is displayed. Thank you in advance.
You can concatenate using pd.concat both the DataFrames then GroupBy name using DataFrame.groupby.
# Assuming `Index` is not a column. If it's a column
# set it as index using `df.set_index("Index")
out = (
pd.concat([df1, df2], ignore_index=True)
.groupby("Name", as_index=False, sort=False)
.sum()
)
out
# Name Age Height
# 0 Tom 47 336
# 1 Bill 66 350
# 2 Jacob 39 180
# 3 Vivian 26 155
# 4 Mary 20 166
# 5 Jack 26 155

groupby cumsum sorted dataframe

I want to group a dataframe by a column then apply a cumsum over the other ordered by the first column descending
df1:
id PRICE DEMAND
0 120 10
1 232 2
2 120 3
3 232 8
4 323 5
5 323 6
6 323 2
df2:
id PRICE DEMAND
0 323 13
1 232 23
2 120 36
I do it in two instructions but I am feeling it can be done with only one sum
data = data.groupby('PRICE',as_index=False).agg({'DEMAND': 'sum'}).sort_values(by='PRICE', ascending=False)
data['DEMAND'] = data['DEMAND'].cumsum()
What you have seems perfectly fine to me. But if you want to chain everything together, first sort then groupby with sort=False so it doesn't change the order. Then you can sum within group and cumsum the resulting Series
(df.sort_values('PRICE', ascending=False)
.groupby('PRICE', sort=False)['DEMAND'].sum()
.cumsum()
.reset_index())
PRICE DEMAND
0 323 13
1 232 23
2 120 36
Another option would be to sort then cumsum and then drop_duplicates:
(df.sort_values('PRICE', ascending=False)
.set_index('PRICE')
.DEMAND.cumsum()
.reset_index()
.drop_duplicates('PRICE', keep='last'))
PRICE DEMAND
2 323 13
4 232 23
6 120 36

Sumifs excel formula in Pandas

I have seen a lot of SUMIFS question being answered here but is very different from the one I need.
1st Trade data frame contains transaction id and C_ID
transaction C_ID
1 101
2 103
3 104
4 101
5 102
6 104
2nd Customer data frame contains C_ID, On/Off, Amount
C_ID On/Off Amount
102 On 320
101 On 400
101 On 200
103 On 60
104 Off 80
104 On 100
So i want to calculate the Amount based on the C_ID with a condition on column 'On/Off' in Customer data frame. The resulting trade data frame should be
transaction C_ID Amount
1 101 600
2 103 60
3 104 100
4 101 600
5 102 320
6 104 100
So here’s the formula in EXCEL on how Amount are calculated
=SUMIFS(Customer.Amount, Customer.C_ID = Trade.C_ID, Customer.On/Off = On)
So i want to replicate this particular formula in Python using Pandas
You can use groupby() on filtered data to compute the sum and map to assign new column to transaction data.
s = df2[df2['On/Off']=='On'].groupby('C_ID')['Amount'].sum()
df1['Amount'] = df1['C_ID'].map(s)
We do filter groupby + reindex assign
df1['Amount']=df2.loc[df2['On/Off']=='On'].groupby(['C_ID']).Amount.sum().reindex(df1.C_ID).tolist()
df1
Out[340]:
transaction C_ID Amount
0 1 101 600
1 2 103 60
2 3 104 100
3 4 101 600
4 5 102 320
5 6 104 100

Pandas DataFrame Groupby two columns and add column for moving average

I have a dataframe that I want to group using multiple columns and then add a calculated column (mean) based on the grouping. Can someone give me a hand?
I have tried the grouping and it works fine, but adding the calculated (rolling mean) column is proving to be a hustle
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], list('AAAAAAAABBBBBBBB'), ['RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW','RED','BLUE','GREEN','YELLOW'], ['1','1','1','1','2','2','2','2','1','1','1','1','2','2','2','2'],[100,112,99,120,105,114,100,150,200,134,167,150,134,189,172,179]]).T
df.columns = ['id','Station','Train','month_code','total']
df2 = df.groupby(['Station','Train','month_code','total']).size().reset_index().groupby(['Station','Train','month_code'])['total'].max()
Looking at getting an outcome similar to this below
Station Train month_code total average
A BLUE 1 112
2 114 113
GREEN 1 99 106.5
2 100 99.5
RED 1 100 100
2 105 102.5
YELLOW 1 120 112.5
2 150 135
B BLUE 1 134 142
2 189 161.5
GREEN 1 167 178
2 172 169.5
RED 1 200 186
2 134 167
YELLOW 1 150 142
2 179 164.5
How about you change your initial groupby to keep the column name 'total'.
df3 = df.groupby(['Station','Train','month_code']).sum()
>>> df3.head()
id total
Station Train month_code
A BLUE 1 2 112
2 6 114
GREEN 1 3 99
2 7 100
RED 1 1 100
Then do a rolling mean on the total column.
df3['average'] = df3['total'].rolling(2).mean()
>>> df3.head()
id total average
Station Train month_code
A BLUE 1 2 112 NaN
2 6 114 113.0
GREEN 1 3 99 106.5
2 7 100 99.5
RED 1 1 100 100.0
You can then still remove the id column if you don't want it.

selecting a column from pandas pivot table

I have the below pivot table which I created from a dataframe using the following code:
table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum)
movements 0 1 2 3 4 5 6 7
days
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2?
plt.plot(x=table.index, y=??)
I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.
You cannot select ndarray for y if you need those two column values in a single plot you can use:
plt.plot(table['0'])
plt.plot(table['2'])
If column names are intergers then:
plt.plot(table[0])
plt.plot(table[2])

Categories

Resources