Pandas remove reversed duplicates across two columns

Pandas remove reversed duplicates across two columns - python

An example DataFrame:
df = pd.DataFrame({'node_a': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'],
'node_b': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
'value': [ 2, 8, 1, 8, 7, 3, 1, 3, 2]})
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
3 Y X 8
4 Y Y 7
5 Y Z 3
6 Z X 1
7 Z Y 3
8 Z Z 2
I need to remove reversed duplicates, e.g. keep node_a = 'X', node_b = 'Y' but remove node_a = 'Y', node_b = 'X'.
Desired output:
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
Please note I need a general solution not specific to this actual data.

Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:
df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)
Result:
print(out)
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2

You could do the following:
# duplicates regardless the order
un_dups = pd.Series([frozenset(row) for row in df[['node_a', 'node_b']].to_numpy()]).duplicated()
# duplicates with the same order
o_dups = df.duplicated(subset=['node_a', 'node_b'])
# keep only those that are not duplicates with reverse order xor
mask = ~(un_dups ^ o_dups)
print(df[mask])
Output
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
The idea is to create a mask that will be False if you are a duplicate in reverse order.
To better understand the approach see the truth values:
node_a node_b value un_dups o_dups xor
0 X X 2 False False False
1 X Y 8 False False False
2 X Z 1 False False False
3 Y X 8 True False True
4 Y Y 7 False False False
5 Y Z 3 False False False
6 Z X 1 True False True
7 Z Y 3 True False True
8 Z Z 2 False False False
As you can see the xor (exclusive or) shows that it output is true whenever the inputs differ. Given that an ordered duplicated is going to be also duplicated when unordered, then xor is only true when the values in the column are duplicates in reverse order.
Finally notice that the mask is the negation of the xor, i.e. those values that are not duplicates.

Here's one way to do it which involves creating a new temporary column that will sort the order of node_a and node_b in each row, and then drop duplicates, keeping the first instance of the ordering:
df['sorted'] = df.apply(lambda x: ''.join(sorted([x['node_a'],x['node_b']])),axis=1)
# node_a node_b value sorted
# 0 X X 2 XX
# 1 X Y 8 XY
# 2 X Z 1 XZ
# 3 Y X 8 XY
# 4 Y Y 7 YY
# 5 Y Z 3 YZ
# 6 Z X 1 XZ
# 7 Z Y 3 YZ
# 8 Z Z 2 ZZ
df.drop_duplicates(subset='sorted').drop('sorted',axis=1)
# node_a node_b value
# 0 X X 2
# 1 X Y 8
# 2 X Z 1
# 4 Y Y 7
# 5 Y Z 3
# 8 Z Z 2

Related

how to get top n values in pandas dataframe if it has repeated values

I have a pandas dataframe say:
x
y
z
1
a
x
1
b
y
1
c
z
2
a
x
2
b
x
3
a
y
4
a
z
If i wanted top 2 values by x, I mean top 2 values by x column which gives:
x
y
z
1
a
x
1
b
y
1
c
z
2
a
x
2
b
x
If i wanted top 2 values by y, I mean top 2 values by y column which gives:
x
y
z
1
a
x
1
b
y
2
a
x
2
b
x
3
a
y
4
a
z
How can I achieve this?

You can use:
>>> df[df['x'].isin(df['x'].value_counts().head(2).index)]
x y z
0 1 a x
1 1 b y
2 1 c z
3 2 a x
4 2 b x
>>> df[df['y'].isin(df['y'].value_counts().head(2).index)]
x y z
0 1 a x
1 1 b y
3 2 a x
4 2 b x
5 3 a y
6 4 a z

def select_top_k(df, col, top_k):
grouping_df = df.groupby(col)
gr_list = list(grouping_df.groups)[:top_k]
temp = grouping_df.filter(lambda x: x[col].iloc[0] in gr_list)
return temp
data = {'x': [1, 1, 1, 2, 2, 3, 4],
'y': ['a', 'b', 'c', 'a', 'b', 'a', 'a'],
'z': ['x', 'y', 'z', 'x', 'x', 'y', 'z']}
df = pd.DataFrame(data)
col = 'x'
top_k = 2
select_top_k(df, col, top_k)

group column values with difference of 3(say) digit in python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1

Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.

Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2

pandas unique id for sequences

I want to generate a unique id for each sequence in a pandas dataframe, where the start of sequence is labeled from another column.
I have the X, Y, and BOOL columns and want the generate the NEW_ID column
X Y BOOL NEW_ID
x y TRUE 1
x y FALSE 1
x y FALSE 1
x y TRUE 2
x y FALSE 2
x y FALSE 2
x y FALSE 2
x y TRUE 3
x y TRUE 4
x y FALSE 4
I am trying to find a solution without any for loops as I have a large dataframe and it takes too long..

Using cumsum with BOOL column
df['New_ID']=df.BOOL.cumsum()
df
Out[39]:
X Y BOOL NEW_ID New_ID
0 x y True 1 1
1 x y False 1 1
2 x y False 1 1
3 x y True 2 2
4 x y False 2 2
5 x y False 2 2
6 x y False 2 2
7 x y True 3 3
8 x y True 4 4
9 x y False 4 4

Exclude all rows that differ in only one column

I have a large(ish) set of experimental data that contains pairs of values. Each pair is associated with a particular barcode. Ideally, each pair should have a unique barcode. Unfortunately, it turns out that I screwed something up during the experiment. Now several pairs share a single barcode. I need to exclude these pairs/barcodes from my analysis.
My data looks kind of like this:
The pairs are in columns 'A' and 'B' -- I just included 'X' to represent some arbitrary associated data:
df = pd.DataFrame({'Barcode' : ['AABBCC', 'AABBCC', 'BABACC', 'AABBCC', 'DABBAC', 'ABDABD', 'DABBAC'],
'A' : ['v', 'v', 'x', 'y', 'z', 'h', 'z'],
'B' : ['h', 'h', 'j', 'k', 'l', 'v', 'l'],
'X' : np.random.randint(10, size = 7)})
df = df[['Barcode', 'A', 'B', 'X']]
df
Barcode A B X
0 AABBCC v h 8
1 AABBCC v h 7
2 BABACC x j 2
3 AABBCC y k 3
4 DABBAC z l 8
5 ABDABD h v 0
6 DABBAC z l 4
I want to get rid of the rows described by barcode 'AABBCC', since this barcode is associated with two different pairs (rows 0 and 1 are both the same pair -- which is fine -- but, row 3 is a different pair).
df.loc[df.Barcode != 'AABBCC']
Barcode A B X
2 BABACC x j 6
4 DABBAC z l 0
5 ABDABD h v 7
6 DABBAC z l 5
My solution thus far:
def duplicates(bar):
if len(df.loc[df.Barcode == bar].A.unique()) > 1 or len(df.loc[df.Barcode == bar].B.unique()) > 1:
return 'collision'
else:
return 'single'
df['Barcode_collision'] = df.apply(lambda row: duplicates(row['Barcode']), axis = 1)
df.loc[df.Barcode_collision == 'single']
Barcode A B X Barcode_collision
2 BABACC x j 6 single
4 DABBAC z l 0 single
5 ABDABD h v 7 single
6 DABBAC z l 5 single
Unfortunately, this is very slow with a large dataframe (~500,000 rows) using my delicate computer. I'm sure there must be a better/faster way. Maybe using the groupby function?
df.groupby(['Barcode', 'A', 'B']).count()
X
Barcode A B
AABBCC v h 2
y k 1
ABDABD h v 1
BABACC x j 1
DABBAC z l 2
Then filtering out rows that have more than one value in the second or third indexes? But my brain and my googling skills can't seem to get me further than this...

You can use filter:
print(df.groupby('Barcode').filter(lambda x: ((x.A.nunique() == 1) or (x.B.nunique() == 1))))
Barcode A B X Barcode_collision
2 BABACC x j 4 single
4 DABBAC z l 9 single
5 ABDABD h v 3 single
6 DABBAC z l 9 single
Another solution with transform and boolean indexing:
g = df.groupby('Barcode')
A = g.A.transform('nunique')
B = g.B.transform('nunique')
print (df[(A == 1) | (B == 1)])
Barcode A B X Barcode_collision
2 BABACC x j 2 single
4 DABBAC z l 6 single
5 ABDABD h v 1 single
6 DABBAC z l 3 single

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas remove reversed duplicates across two columns - python

Related

how to get top n values in pandas dataframe if it has repeated values

group column values with difference of 3(say) digit in python

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

pandas unique id for sequences

Exclude all rows that differ in only one column

Categories

Resources