I have a pandas data frame and I am looking for a simple way to identify rows where the values are the same (duplicate), irrespective of the order of the columns.
For example:
df = pd.DataFrame([[1, 3], [4, 2], [3, 1], [2, 3], [2, 4], [1, 3]], columns=["a", "b"])
print(df)
a b
0 1 3
1 4 2
2 3 1
3 2 3
4 2 4
5 1 3
The code should be able to identify the rows (0, 2, 5), and (1, 4) as the duplicate ones respectively.
I can't think of an efficient solution other than using a set operator to store these pairs and then finding the duplicates. Can you suggest a better method since the data frame is quite big, and thus the suggested method is very inefficient.
You could do this using np.sort on axis=1, then groupby
u = pd.DataFrame(np.sort(df,axis=1),index=df.index)
[tuple(g.index) for _,g in u[u.duplicated(keep=False)].groupby(list(u.columns))]
[(0, 2, 5), (1, 4)]
Or similarly:
u[u.duplicated(keep=False)].groupby(list(u.columns)).groups.values()
Outputs:
dict_values([Int64Index([0, 2, 5], dtype='int64'), Int64Index([1, 4], dtype='int64')])
Related
With python Pandas, I'm trying to filter out the data that contains the specified value in the array, I try to use python in to filter value, but it's not working, I want to know if there is a way to achieve such a function without looping
import pandas as pd
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [[1, 2, 3], [2, 3], [3], [1, 2, 3]]})
df = 1 in df['custom_test_type']
A B
0 1 [1, 2, 3]
1 2 [2, 3]
2 3 [3]
3 4 [1, 2, 3]
I'm try to filter 1 in row B, so expected output will be:
A B
0 1 [1, 2, 3]
3 4 [1, 2, 3]
but the output always be True
due to my limited ability, Any help or explanation is welcome! Thank you.
You need to use a loop/list comprehension:
out = df[[1 in l for l in df['B']]]
A pandas version would be more verbose and less efficient:
out = df[df['B'].explode().eq(1).groupby(level=0).any()]
Output:
A B
0 1 [1, 2, 3]
3 4 [1, 2, 3]
I have a large pandas series that each row in it, is a list of numbers.
I want to detect rows that are subset of other rows and delete them from series.
my solution is using 2 for loops but it is very slow. Can anyone help me and introduce a faster way for this because my for loop is very slow.
for example, we must delete rows 2, 4 in the below sample because they are subsets of rows 1, 3 respectively.
import pandas as pd
cycles = pd.Series([[1, 2, 3, 4], [3, 4], [5, 6, 9, 7], [5, 9]])
First, you could sort the lists since they are numbers and convert them to string. Then for every string simply check if it is a substring of any of the other rows, if so it is a subset. Since everything is sorted we can be sure the order of the numbers will not affect this step.
Finally, filter out only the ones that are not identified as a subset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'cycles': [[9, 5, 4, 3], [9, 5, 4], [2, 4, 3], [2, 3]],
'members': [4, 3, 3, 2]
})
print(df)
cycles members
0 [9, 5, 4, 3] 4
1 [9, 5, 4] 3
2 [2, 4, 3] 3
3 [2, 3] 2
df['cycles'] = df['cycles'].map(np.sort)
df['cycles_str'] = [','.join(map(str, c)) for c in df['cycles']]
# Here we check if matches are >1, because it will match with itself once!
df['is_subset'] = [df['cycles_str'].str.contains(c_str).sum() > 1 for c_str in df['cycles_str']]
df = df.loc[df['is_subset'] == False]
df = df.drop(['cycles_str', 'is_subset'], axis=1)
cycles members
0 [3, 4, 5, 9] 4
2 [2, 3, 4] 3
Edit - The above doesn't work for [1, 2, 4] & [1, 2, 3, 4]
Rewrote the code. This uses 2 loops and set to check for subsets using list comprehension:
# check if >1 True, as it will match with itself once!
df['is_subset'] = [[set(y).issubset(set(x)) for x in df['cycles']].count(True)>1 for y in df['cycles']]
df = df.loc[df['is_subset'] == False]
df = df.drop('is_subset', axis=1)
print(df)
cycles members
0 [9, 5, 4, 3] 4
2 [2, 4, 3] 3
Say I have tables A and B where
A= [[1, 2],
[3, 4]]
and
B= [[5, 6],
[7, 8]]
What I would like to return is the Pandas dataframe
[[(1,5), (2,6)],
[(3,7), (4,8)]].
How can I do this? I want to do a zip on the elements while preserving the shape of the dataframe. Anyone know how this is possible?
If both A and B were Dataframe
pd.concat([A.stack(),B.stack()]).groupby(level=[0,1]).agg(tuple).unstack()
Out[24]:
0 1
0 (1, 5) (2, 6)
1 (3, 7) (4, 8)
You can zip it
alist = [list(zip(A[i], B[i])) for i in range(len(A))]
and then convert to a Pandas DataFrame
pd.DataFrame(alist)
I have a data frame with multiple columns, and I want to find the duplicates in some of them. My columns go from A to Z. I want to know which lines have the same values in columns A, D, F, K, L, and G.
I tried:
df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
However, this uses all of the columns.
I also tried
print(df[df.duplicated(['A', 'D', 'F', 'K', 'L', 'P'])])
This only returns the duplication's index. I want the index of both lines that have the same values.
Your final attempt is close. Instead of grouping by all columns, just use a list of the ones you want to consider:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [3, 3, 3, 4, 4, 5],
'C': [6, 7, 8, 9, 10, 11]})
res = df.groupby(['A', 'B']).apply(lambda x: (x.index).tolist()).reset_index()
print(res)
# A B 0
# 0 1 3 [0, 1, 2]
# 1 2 4 [3, 4]
# 2 2 5 [5]
Different layout of groupby
df.index.to_series().groupby([df['A'],df['B']]).apply(list)
Out[449]:
A B
1 3 [0, 1, 2]
2 4 [3, 4]
5 [5]
dtype: object
You can have .groupby return a dict with keys being the group labels (tuples for multiple columns) and the values being the Index
df.groupby(['A', 'B']).groups
#{(1, 3): Int64Index([0, 1, 2], dtype='int64'),
# (2, 4): Int64Index([3, 4], dtype='int64'),
# (2, 5): Int64Index([5], dtype='int64')}
I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:
col
1 | 1
2 | 2
3 | 1
4 | 1
5 | 2
I'd like to be able to get [1, 3, 4] and [2, 5]. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something like df[col == x...].
First filter all duplicated rows and then groupby with apply or convert index to_series:
df = df[df.col.duplicated(keep=False)]
a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
And if need nested lists:
L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]
If need use only first column is possible selected by position with iloc:
a = df[df.iloc[:,0].duplicated(keep=False)]
.groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object