Find all duplicate rows in a pandas dataframe

Find all duplicate rows in a pandas dataframe - python

I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:
col
1 | 1
2 | 2
3 | 1
4 | 1
5 | 2
I'd like to be able to get [1, 3, 4] and [2, 5]. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something like df[col == x...].

First filter all duplicated rows and then groupby with apply or convert index to_series:
df = df[df.col.duplicated(keep=False)]
a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
And if need nested lists:
L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]
If need use only first column is possible selected by position with iloc:
a = df[df.iloc[:,0].duplicated(keep=False)]
.groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object

Related

Use a values of list to filter Pandas dataframe

With python Pandas, I'm trying to filter out the data that contains the specified value in the array, I try to use python in to filter value, but it's not working, I want to know if there is a way to achieve such a function without looping
import pandas as pd
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [[1, 2, 3], [2, 3], [3], [1, 2, 3]]})
df = 1 in df['custom_test_type']
A B
0 1 [1, 2, 3]
1 2 [2, 3]
2 3 [3]
3 4 [1, 2, 3]
I'm try to filter 1 in row B, so expected output will be:
A B
0 1 [1, 2, 3]
3 4 [1, 2, 3]
but the output always be True
due to my limited ability, Any help or explanation is welcome! Thank you.

You need to use a loop/list comprehension:
out = df[[1 in l for l in df['B']]]
A pandas version would be more verbose and less efficient:
out = df[df['B'].explode().eq(1).groupby(level=0).any()]
Output:
A B
0 1 [1, 2, 3]
3 4 [1, 2, 3]

count number of elements in a list inside a dataframe

Assume that we have a dataframe and inside the dataframe in a column we have lists. How can I count the number per list? For example
A B
(1,2,3) (1,2,3,4)
(1) (1,2,3)
I would like to create 2 new columns with the count of each column. something like the following
A B C D
(1,2,3) (1,2,3,4) 3 4
(1) (1,2,3) 1 3
where C corresponds to the number of the elements in the column A for that row, and D for the number of elements in the list in column B for that row
I cannot just do
df['A'] = len(df['A'])
Because that returns the len of my dataframe

You can use the .apply method on the Series for the column df['A'].
>>> import pandas
>>> import pandas as pd
>>> pd.DataFrame({"column": [[1, 2], [1], [1, 2, 3]]})
column
0 [1, 2]
1 [1]
2 [1, 2, 3]
>>> df = pd.DataFrame({"column": [[1, 2], [1], [1, 2, 3]]})
>>> df["column"].apply
<bound method Series.apply of 0 [1, 2]
1 [1]
2 [1, 2, 3]
Name: column, dtype: object>
>>> df["column"].apply(len)
0 2
1 1
2 3
Name: column, dtype: int64
>>> df["column"] = df["column"].apply(len)
>>>
See Python Pandas, apply function for a more general discussion of apply.

You can pandas' apply with the len function to each column like bellow to obtain what you are looking for
# package importation
import pandas as pd
# creating a sample dataframce
df = pd.DataFrame(
{
'A':[[1,2,3],[32,4],[45,67,23,54,3],[],[0]],
'B':[[2],[3],[2,3],[5,6,1],[98,44]]
},
index=['z','y','m','n','o']
)
# computing lengths of lists in the column
df['items_in_A'] = df['A'].apply(len)
df['items_in_B'] = df['B'].apply(len)
# check the putput
print(df)
output
A B items_in_A items_in_B
z [1, 2, 3] [2] 3 1
y [32, 4] [3] 2 1
m [45, 67, 23, 54, 3] [2, 3] 5 2
n [] [5, 6, 1] 0 3
o [0] [98, 44] 1 2

How to sort a Python DataFrame by second element of list

So the title is a bit confusing but essentially, I have a Dataframe with two columns, one for the the character ("c") and one for the character's coordinates ("loc"). I would like to sort the dataframe by the Y coordinate. So far i have managed to sort the dataframe by the X cooridate using the sort_values() function:
df = pd.DataFrame({"c":["i", "a"," d","m"], "loc":[[1, 2], [3, 3], [4, 2], [3,5]]})
df.sort_values(by=["loc"], inplace=True)
which outputs:
c loc
0 i [1, 2]
1 a [3, 3]
3 m [3, 5]
2 d [4, 2]
The output I am aiming for is:
c loc
0 i [1, 2]
2 d [4, 2]
1 a [3, 3]
3 m [3, 5]
Cycling through the dataframe and inversing the y and x values is not an option as the full dataframe will be quite large. I do think this should be possible as the new version of pd.df.sort_values() has a "key" input (link to pd.df.sort_values() documentation), but I am not sufficiently familiar with the "key" input to properly execute this.

Use key parameter in sort_values:
df.sort_values(by ='loc', key=lambda x: x.str[1])
Output:
c loc
0 i [1, 2]
2 d [4, 2]
1 a [3, 3]
3 m [3, 5]

Python - pick a value from a list basing on another list

I've got a dataframe. In column A there is a list of integers, in column B - an integer. I want to pick n-th value of the column A list, where n is a number from column B. So if in columns A there is [1,5,6,3,4] and in column B: 2, I want to get '6'.
I tried this:
result = [y[x] for y in df['A'] for x in df['B']
But it doesn't work. Please help.

Use zip with list comprehension:
df['new'] = [y[x] for x, y in zip(df['B'], df['A'])]
print (df)
A B new
0 [1, 2, 3, 4, 5] 1 2
1 [1, 2, 3, 4] 2 3

You can go for apply i.e
df = pd.DataFrame({'A':[[1,2,3,4,5],[1,2,3,4]],'B':[1,2]})
A B
0 [1, 2, 3, 4, 5] 1
1 [1, 2, 3, 4] 2
# df.apply(lambda x : np.array(x['A'])[x['B']],1)
# You dont need np.array here, use it when the column B is also a list.
df.apply(lambda x : x['A'][x['B']],1) # Thanks #Zero
0 2
1 3
dtype: int64

python pandas filtering involving lists

I am currently using Pandas in python 2.7. My dataframe looks similar to this:
>>> df
0
1 [1, 2]
2 [2, 3]
3 [4, 5]
Is it possible to filter rows by values in column 1? For example, if my filter value is 2, the filter should return a dataframe containing the first two rows.
I have tried a couple of ways already. The best thing I can think of is to do a list comprehension that returns the index of rows in which the value exist. Then, I could filter the dataframe with the list of indices. But, this would be very slow if I want to filter multiple times with different values. Ideally, I would like something that uses the build in Pandas functions in order to speed up the process.

You can use boolean indexing:
import pandas as pd
df = pd.DataFrame({'0':[[1, 2],[2, 3], [4, 5]]})
print (df)
0
0 [1, 2]
1 [2, 3]
2 [4, 5]
print (df['0'].apply(lambda x: 2 in x))
0 True
1 True
2 False
Name: 0, dtype: bool
print (df[df['0'].apply(lambda x: 2 in x)])
0
0 [1, 2]
1 [2, 3]

You can also use boolean indexing with a list comprehension:
>>> df[[2 in row for row in df['0']]]
0
0 [1, 2]
1 [2, 3]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find all duplicate rows in a pandas dataframe - python

Related

Use a values of list to filter Pandas dataframe

count number of elements in a list inside a dataframe

How to sort a Python DataFrame by second element of list

Python - pick a value from a list basing on another list

python pandas filtering involving lists

Categories

Resources