Remove duplicates based on combination of two columns in Pandas

Remove duplicates based on combination of two columns in Pandas - python

I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings.
For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column. Need to drop one of these two rows. Return the non duplicated rows as well.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0 0 ryan delta 1
1 1 delta ryan 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 bravo delta 3
5 5 alpha ryan 9
6 6 ryan alpha 9
Answer df should be:
finaldf
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9

Try as follows:
res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Explanation
First, we use df.filter to select just the columns with person*.
For these columns only we use df.apply to turn each row (axis=1) into a frozenset. So, at this stage, we are looking at a pd.Series like this:
0 (ryan, delta)
1 (ryan, delta)
2 (alpha, delta)
3 (bravo, delta)
4 (bravo, delta)
5 (alpha, ryan)
6 (alpha, ryan)
dtype: object
Now, we want to select the duplicate rows, using Series.duplicated and add ~ as a prefix to the resulting boolean series to select the inverse from the original df.
Finally, we reset the index with df.reset_index.

Here's a less general approach than the one given by #ouroboros1, this only works for your two columns case
#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]

You can put the two person columns in order within each row, then drop duplicates.
import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)
After the swap:
person1 person2 messages
0 0 ryan delta 1
1 1 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 delta bravo 3
5 5 ryan alpha 9
6 6 ryan alpha 9
After dropping duplicates:
person1 person2 messages
0 0 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
5 5 ryan alpha 9

Related

Mesh / divide / explode values in a column of a DataFrames according to a number of meshes for each value

Given a DataFrame
df1 :
value mesh
0 10 2
1 12 3
2 5 2
obtain a new DataFrame df2 in which for each value of df1 there are mesh values, each one obtained by dividing the corresponding value of df1 by its mesh:
df2 :
value/mesh
0 5
1 5
2 4
3 4
4 4
5 2.5
6 2.5
More general:
df1 :
value mesh_value other_value
0 10 2 0
1 12 3 1
2 5 2 2
obtain:
df2 :
value/mesh_value other_value
0 5 0
1 5 0
2 4 1
3 4 1
4 4 1
5 2.5 2
6 2.5 2

You can do map
df2['new'] = df2['value/mesh'].map(dict(zip(df1.eval('value/mesh'),df1.index)))
Out[243]:
0 0
1 0
2 1
3 1
4 1
5 2
6 2
Name: value/mesh, dtype: int64

Try as follows:
Use Series.div for value / mesh_value, and apply Series.reindex using np.repeat with df.mesh_value as the input array for the repeats parameter.
Next, use pd.concat to combine the result with df.other_value along axis=1.
Finally, rename the column with result of value / mesh_value (its default name will be 0) using df.rename, and chain df.reset_index to reset to a standard index.
df2 = pd.concat([df.value.div(df.mesh_value).reindex(
np.repeat(df.index,df.mesh_value)),df.other_value], axis=1)\
.rename(columns={0:'value_mesh_value'}).reset_index(drop=True)
print(df2)
value_mesh_value other_value
0 5.0 0
1 5.0 0
2 4.0 1
3 4.0 1
4 4.0 1
5 2.5 2
6 2.5 2
Or slightly different:
Use df.assign to add a column with the result of df.value.div(df.mesh_value), and reindex / rename in same way as above.
Use df.drop to get rid of columns that you don't want (value, mesh_value) and use df.iloc to change the column order (e.g. we want ['value_mesh_value','other_value'] instead of other way around (hence: [1,0]). And again, reset index.
We put all of this between brackets and assign it to df2.
df2 = (df.assign(tmp=df.value.div(df.mesh_value)).reindex(
np.repeat(df.index,df.mesh_value))\
.rename(columns={'tmp':'value_mesh_value'})\
.drop(columns=['value','mesh_value']).iloc[:,[1,0]]\
.reset_index(drop=True))
# same result

Calculate difference of two columns from two different dataframes based on condition

I have two dataframes with common columns. I would like to create a new column that contains the difference between two columns (one from each dataframe) based on a condition from a third column.
df_a:
Time Volume ID
1 5 1
2 6 2
3 7 3
df_b:
Time Volume ID
1 2 2
2 3 1
3 4 3
output is appending a new column to df_a with the differnece between volume columns (df_a.Volume - df_b.Volume) where the two IDs are equal.
df_a:
Time Volume ID Diff
1 5 1 2
2 6 2 4
3 7 3 3

If ID is unique per row in each dataframe:
df_a['Diff'] = df_a['Volume'] - df_a['ID'].map(df_b.set_index('ID')['Volume'])
Output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3

An option is to merge the two dfs on ID and then calculate Diff:
df_a = df_a.merge(df_b.drop(['Time'], axis=1), on="ID", suffixes=['', '2'])
df_a['Diff'] = df_a['Volume'] - df_a['Volume2']
df:
Time Volume ID Volume2 Diff
0 1 5 1 3 2
1 2 6 2 2 4
2 3 7 3 4 3

Merge the two dataframes on 'ID', then take the difference:
import pandas as pd
df_a = pd.DataFrame({'Time': [1,2,3], 'Volume': [5,6,7], 'ID':[1,2,3]})
df_b = pd.DataFrame({'Time': [1,2,3], 'Volume': [2,3,4], 'ID':[2,1,3]})
merged = pd.merge(df_a,df_b, on = 'ID')
df_a['Diff'] = merged['Volume_x'] - merged['Volume_y']
print(df_a)
#output:
Time Volume ID Diff
0 1 5 1 2
1 2 6 2 4
2 3 7 3 3

subtract between two rows

I have a dataset similar to this:
name group val1 val2
John A 3 2
Cici B 4 3
Ian C 2 2
Zhang D 2 1
Zhang E 1 2
Ian F 1 2
John B 2 1
Ian B 1 2
I did a pivot table and it now looks like this using this piece of code
df_pivot = pd.pivot_table(df, values=['val_1, val_2], index=['name', 'group']).reset_index()
df
name group val1 val2
John A 3 2
John B 2 1
Ian C 2 2
Ian F 1 2
Ian B 1 2
Zhang D 2 1
Zhang E 1 2
Cici B 4 3
After the pivot table, I need to calculate 1) groupby name 2) calculate the delta between groups. Take John as an example
The output should be:
John A-B 1 1
Ian C-F 1 0
F-B 0 0
B-C 1 0 (the delta is -1, but we only do absolute value)
How to move forward from my pivot table

Getting each combination to subtract (a-b, a-c, b-c) won't be directly possible with a simple groupby function. I suggest that you pivot your data and use a custom function to calculate each combination of possible differences:
import pandas as pd
import itertools
def combo_subtraction(df, level=0):
unique_groups = df.columns.levels[level]
combos = itertools.combinations(unique_groups, 2)
pieces = {}
for g1, g2 in combos:
name = "{}-{}".format(g1, g2)
pieces[name] = df.xs(g1, level=level, axis=1) - df.xs(g2, level=level, axis=1)
return pd.concat(pieces)
out = (df.pivot(index="name", columns="group") # convert data to wide format
.pipe(combo_subtraction, level=1) # apply our combination subtraction
.dropna() # clean up the result
.swaplevel()
.sort_index())
print(out)
val1 val2
name
Ian A-B 0.0 0.0
A-C -1.0 0.0
B-C -1.0 0.0
John A-B 1.0 1.0
Zhang A-B 1.0 -1.0
The combo_subtraction function simply iterates over all possible combinations of 2 of "A", "B", and "C" and performs the subtraction operation. It then sticks the results of these combinations back together forming our result.

Sorting across multiple columns pandas

df = pd.DataFrame([["Alpha", 3, 2, 4], ["Bravo", 2, 3, 1], ["Charlie", 4, 1, 3], ["Delta", 1, 4, 2]],
columns = ["Company", "Running", "Combat", "Range"])
print(df)
Company Running Combat Range
0 Alpha 3 2 4
1 Bravo 2 3 1
2 Charlie 4 1 3
3 Delta 1 4 2
Hi, I am trying to sort the the following dataframe so the rows would be arranged such that the best performing across the three columns would be at the top. In this case would be Bravo company as it is 2 in running, 3 in drills and 1 in range.
Would this approach work if the list have a lot more companies and it is hard to know the exact "best performing company"?
I have tried:
df_sort = df.sort_values(['Running', 'Combat', 'Range'], ascending=[True, True, True])
current output:
Company Running Combat Range
1 Delta 1 4 2
0 Bravo 2 3 1
3 Alpha 3 2 4
2 Charlie 4 1 3
but it doesn't turn out how I wanted it to be. Can this be done through pandas?
I was expecting the output to be:
Company Running Combat Range
0 Bravo 2 3 1
1 Delta 1 4 2
2 Charlie 4 1 3
3 Alpha 3 2 4

If want sorting by means per rows first create mean, then add Series.argsort for positions of sorted values and last change order of values by DataFrame.iloc:
df1 = df.iloc[df.mean(axis=1).argsort()]
print (df1)
Company Running Combat Range
1 Bravo 2 3 1
3 Delta 1 4 2
2 Charlie 4 1 3
0 Alpha 3 2 4
EDIT: If need remove some columns before by DataFrame.drop:
cols = ['Overall','Subordination']
df2 = text_df.iloc[text_df.drop(cols, axis=1).mean(axis=1).argsort()]
print (df2)
Company Running Combat Overall Subordination Range
1 Bravo 2 3 0.70 Poor 1
3 Delta 1 4 0.83 Good 2
2 Charlie 4 1 0.81 Good 3
0 Alpha 3 2 0.91 Excellent 4

Function to turn single Pandas dataframe into multi-year dataframe

I have this Pandas dataframe which is a single year snapshot:
data = pd.DataFrame({'ID' : (1, 2),
'area': (2, 3),
'population' : (100, 200),
'demand' : (100, 200)})
I want to make this into a time series where population grows by 10% per year and demand grows by 20% per year. In this example I do this for two extra years.
This should be the output (note: it includes an added 'year' column):
output = pd.DataFrame({'ID': (1,2,1,2,1,2),
'year': (1,1,2,2,3,3),
'area': (2,3,2,3,2,3),
'population': (100,200,110,220,121,242),
'demand': (100,200,120,240,144,288)})

Setup variables:
k = 5 #Number of years to forecast
a = 1.20 #Demand Growth
b = 1.10 #Population Growth
Forecast dataframe:
df_out = (data[['ID','area']].merge(pd.concat([(data[['demand','population']].mul([pow(a,i),pow(b,i)])).assign(year=i+1) for i in range(k)]),
left_index=True, right_index=True)
.sort_values(by='year'))
print(df_out)
Output:
ID area demand population year
0 1 2 100.00 100.00 1
1 2 3 200.00 200.00 1
0 1 2 120.00 110.00 2
1 2 3 240.00 220.00 2
0 1 2 144.00 121.00 3
1 2 3 288.00 242.00 3
0 1 2 172.80 133.10 4
1 2 3 345.60 266.20 4
0 1 2 207.36 146.41 5
1 2 3 414.72 292.82 5

create a numpy array with [1.1, 1.2] that I repeat and cumprod
prepend a set of ones [1.0, 1.0] to account for the initial condition
multiply by the values of a conveniently stacked pd.Series
manipulate into a pd.DataFrame constructor
clean up indices and what not
k = 5
cols = ['ID', 'area']
cum_ret = np.vstack(
[np.ones((1, 2)), np.array([[1.2, 1.1]]
)[[0] * k].cumprod(0)])[:, [0, 0, 1, 1]]
s = data.set_index(cols).unstack(cols)
pd.DataFrame(
cum_ret * s.values,
columns=s.index
).stack(cols).reset_index(cols).reset_index(drop=True)
ID area demand population
0 1 2 100.000 100.000
1 2 3 200.000 200.000
2 1 2 120.000 110.000
3 2 3 240.000 220.000
4 1 2 144.000 121.000
5 2 3 288.000 242.000
6 1 2 172.800 133.100
7 2 3 345.600 266.200
8 1 2 207.360 146.410
9 2 3 414.720 292.820
10 1 2 248.832 161.051
11 2 3 497.664 322.102

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates based on combination of two columns in Pandas - python

Related

Mesh / divide / explode values in a column of a DataFrames according to a number of meshes for each value

Calculate difference of two columns from two different dataframes based on condition

subtract between two rows

Sorting across multiple columns pandas

Function to turn single Pandas dataframe into multi-year dataframe

Categories

Resources