Removing every instance of a duplicate from a dataframe - python

Is there way to remove every instance of duplicates in Pandas? I don't see an option in drop_duplicates(). Is there any way perhaps of getting the indices of the duplicate?

Here's one way:
In [11]: df = pd.DataFrame([[1, 2], [1, 2], [1, 2], [3, 4]])
In [12]: df[~(df.duplicated() | df.duplicated(take_last=True))]
Out[12]:
0 1
3 3 4
Perhaps there's a better way!

Related

How can I merge a list into a dataframe?

I have the following dataframe and list:
df = [[[1,2,3],'a'],[[4,5],'b'],[[6,7,8],'c']]
list = [[1,2,3],[4,5]]
And I want to do a inner merge between them, so I can keep the items in common. This will be my result:
df = [[1,2,3],'a'],[[4,5],'b']]
I have been thinking in converting both to strings, but even if I convert my list to string, I haven't been able to merge both of them as the merge function requires the items to be series or dataframes (not strings). This could be a great help!!
Thanks
If I understand you correctly, you want only keep rows from the dataframe where the values (lists) are both in the column and the list:
lst = [[1, 2, 3], [4, 5]]
print(df[df["col1"].isin(lst)])
Prints:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
DataFrame used:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
2 [6, 7, 8] c
Thanks for your answer!
This is what worked for me:
Convert my list to a series (my using DB):
match = pd.Series(','.join(map(str,match)))
Convert the list of my master DB into a string:
df_temp2['match_s'].loc[m] =
','.join(map(str,df_temp2['match'].loc[m]))
Applied an inner merge on both DB:
df_temp3 = df_temp2.merge(match.rename('match'), how='inner',
left_on='match_s', right_on='match')
Hope it also works for somebody else :)

Finding duplicate row pairs irrespective of column order

I have a pandas data frame and I am looking for a simple way to identify rows where the values are the same (duplicate), irrespective of the order of the columns.
For example:
df = pd.DataFrame([[1, 3], [4, 2], [3, 1], [2, 3], [2, 4], [1, 3]], columns=["a", "b"])
print(df)
a b
0 1 3
1 4 2
2 3 1
3 2 3
4 2 4
5 1 3
The code should be able to identify the rows (0, 2, 5), and (1, 4) as the duplicate ones respectively.
I can't think of an efficient solution other than using a set operator to store these pairs and then finding the duplicates. Can you suggest a better method since the data frame is quite big, and thus the suggested method is very inefficient.
You could do this using np.sort on axis=1, then groupby
u = pd.DataFrame(np.sort(df,axis=1),index=df.index)
[tuple(g.index) for _,g in u[u.duplicated(keep=False)].groupby(list(u.columns))]
[(0, 2, 5), (1, 4)]
Or similarly:
u[u.duplicated(keep=False)].groupby(list(u.columns)).groups.values()
Outputs:
dict_values([Int64Index([0, 2, 5], dtype='int64'), Int64Index([1, 4], dtype='int64')])

python pandas filtering involving lists

I am currently using Pandas in python 2.7. My dataframe looks similar to this:
>>> df
0
1 [1, 2]
2 [2, 3]
3 [4, 5]
Is it possible to filter rows by values in column 1? For example, if my filter value is 2, the filter should return a dataframe containing the first two rows.
I have tried a couple of ways already. The best thing I can think of is to do a list comprehension that returns the index of rows in which the value exist. Then, I could filter the dataframe with the list of indices. But, this would be very slow if I want to filter multiple times with different values. Ideally, I would like something that uses the build in Pandas functions in order to speed up the process.
You can use boolean indexing:
import pandas as pd
df = pd.DataFrame({'0':[[1, 2],[2, 3], [4, 5]]})
print (df)
0
0 [1, 2]
1 [2, 3]
2 [4, 5]
print (df['0'].apply(lambda x: 2 in x))
0 True
1 True
2 False
Name: 0, dtype: bool
print (df[df['0'].apply(lambda x: 2 in x)])
0
0 [1, 2]
1 [2, 3]
You can also use boolean indexing with a list comprehension:
>>> df[[2 in row for row in df['0']]]
0
0 [1, 2]
1 [2, 3]

Python pandas series access by index: Data must be 1 dimensional error

I have a pandas series (named "clusters") which somewhat looks like:
0 [[1, 2, 3], [4, 5, 6]]
1 [[1, 2, 3], [9, 10, 11]]
I get this series by converting: list > dataframe > as_matrix
After processing the matrix I get the series.
I want to access the series by the index which is 0 and 1 here.
But when I do clusters[0] or clusters[1].
I get an error Data must be 1 dimensional error
I dont know what the issue is here.
Alternatively if loop through this series how do I access the index ?
So if I say:
for k in clusters:
print k
I get [[1, 2, 3], [4, 5, 6]]. But I want to get the index this "[[1, 2, 3], [4, 5, 6]]" is linked to. How do I get that. I tried k.index but nothing works.
You can iterate through items, which iterates with the index label:
In [11]: for ind, k in clusters.items():
print(ind)
0
1
I think there is something funky with your Series as you ought to be able to access by index:
In [12]: clusters[0]
Out[12]: [[1, 2, 3], [4, 5, 6]]
In [13]: clusters.loc[0]
Out[13]: [[1, 2, 3], [4, 5, 6]]
As for the getting the index you might want to do something like
print clusters.index(k)
If you do k.index, it won't show you anything.

Accessing the Kth group in Pandas

Is this the only way to do it? It's quite verbose.
k = 0
grouped = df.groupby('A')
df.ix[grouped.groups[list(grouped.groups)[k]]]
Also, wouldn't list(grouped.groups) return keys in a meaningless order? (unordered dictionary)
Aside from the ordering of groups, is there a more consise way to get a group? I don't necessarily need to get the Kth one, although it would be nice to get them in the order they appear in the dataframe.
If you know the key, the concise way is get_group:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: g = df.groupby('A')
In [13]: g.get_group(1)
Out[13]:
A B
0 1 2
1 1 4
As mentioned the group keys are not necessarily ordered, you can access them (as is) with levels:
In [14]: g.grouper.levels
Out[14]: [Int64Index([1, 5], dtype='int64')]
if this is for just one column you can use unique to get them in the order they appear i.e. not sorted:
In [15]: df.A.unique()
Out[15]: array([1, 5])

Categories

Resources