pandas drop_duplicates using comparison function

pandas drop_duplicates using comparison function - python

Is it somehow possible to use pandas.drop_duplicates with a comparison operator which compares two objects in a particular column in order to identify duplicates? If not, what is the alternative?
Here is an example where it could be used:
I have a pandas DataFrame which has lists as values in a particular column and I would like to have duplicates removed based on column A
import pandas as pd
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2]]} )
print df
giving me
A
0 [1, 2]
1 [2, 3]
2 [1, 2]
Using pandas.drop_duplicates
df.drop_duplicates( 'A' )
gives me a TypeError
[...]
TypeError: type object argument after * must be a sequence, not itertools.imap
However, my desired result is
A
0 [1, 2]
1 [2, 3]
My comparison function would be here:
def cmp(x,y):
return x==y
But in principle it could be something else, e.g.,
def cmp(x,y):
return x==y and len(x)>1
How can I remove duplicates based on the comparison function in a efficient way?
Even more, what could I do if I had more columns to compare using a different comparison function, respectively?

Option 1
df[~pd.DataFrame(df.A.values.tolist()).duplicated()]
Option 2
df[~df.A.apply(pd.Series).duplicated()]

IIUC, your question is how to use an arbitrary function to determine what is a duplicate. To emphasize this, let's say that two lists are duplicates if the sum of the first item, plus the square of the second item, is the same in each case
In [59]: In [118]: df = pd.DataFrame( {'A': [[1,2],[4,1],[2,3]]} )
(Note that the first and second lists are equivalent, although not same.)
Python typically prefers key functions to comparison functions, so here we need a function to say what is the key of a list; in this case, it is lambda l: l[0] + l[1]**2.
We can use groupby + first to group by the values of the key function, then take the first of each group:
In [119]: df.groupby(df.A.apply(lambda l: l[0] + l[1]**2)).first()
Out[119]:
A
A
5 [1, 2]
11 [2, 3]
Edit
Following further edits in the question, here are a few more examples using
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2], [1], [1], [2]]} )
Then for
def cmp(x,y):
return x==y
this could be
In [158]: df.groupby(df.A.apply(tuple)).first()
Out[158]:
A
A
(1,) [1]
(1, 2) [1, 2]
(2,) [2]
(2, 3) [2, 3]
for
def cmp(x,y):
return x==y and len(x)>1
this could be
In [184]: class Key(object):
.....: def __init__(self):
.....: self._c = 0
.....: def __call__(self, l):
.....: if len(l) < 2:
.....: self._c += 1
.....: return self._c
.....: return tuple(l)
.....:
In [187]: df.groupby(df.A.apply(Key())).first()
Out[187]:
A
A
1 [1]
2 [1]
3 [2]
(1, 2) [1, 2]
(2, 3) [2, 3]
Alternatively, this could also be done much more succinctly via
In [190]: df.groupby(df.A.apply(lambda l: np.random.rand() if len(l) < 2 else tuple(l))).first()
Out[190]:
A
A
0.112012068449 [2]
0.822889598152 [1]
0.842630848774 [1]
(1, 2) [1, 2]
(2, 3) [2, 3]
but some people don't like these Monte-Carlo things.

Lists are unhashable in nature. Try converting them to hashable types such as tuples and then you can continue to use drop_duplicates:
df['A'] = df['A'].map(tuple)
df.drop_duplicates('A').applymap(list)
One way of implementing it using a function would be based on computing value_counts of the series object, as duplicated values get aggregated and we are interested in only the index part (which by the way is unique) and not the actual count part.
def series_dups(col_name):
ser = df[col_name].map(tuple).value_counts(sort=False)
return (pd.Series(data=ser.index.values, name=col_name)).map(list)
series_dups('A')
0 [1, 2]
1 [2, 3]
Name: A, dtype: object
If you do not want to convert the values to tuple but rather process the values as they are, you could do:
Toy data:
df = pd.DataFrame({'A': [[1,2], [2,3], [1,2], [3,4]],
'B': [[10,11,12], [11,12], [11,12,13], [10,11,12]]})
df
def series_dups_hashable(frame, col_names):
for col in col_names:
ser, indx = np.unique(frame[col].values, return_index=True)
frame[col] = pd.Series(data=ser, index=indx, name=col)
return frame.dropna(how='all')
series_dups_hashable(df, ['A', 'B']) # Apply to subset/all columns you want to check

Related

Lists become pd.Series, the again lists with one dimension more

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]

This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)

Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6

values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

Scripting a simple counter

I wanted to create a simple script, which counts values in one column, that are higher in another column:
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 1 0
2 3 2
My function:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
return a_counter, b_counter
However
diff(df)
returns (3, 1), instead of (2,0). I know the problem is that every single value of one column gets compared to every value of the other column (e.g. 1 gets compared to 0 and 2 of column b). There probably is a special function for my problem, but can you help me fix my script?

I would suggest adding some helper columns in an intuitive way to help compute the sum of each condition a > b and b > a
A working example based on your code :
import numpy as np
import pandas as pd
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
def diff(dataframe):
dataframe['a>b'] = np.where(dataframe['a']>dataframe['b'], 1, 0)
dataframe['b>a'] = np.where(dataframe['b']>dataframe['a'], 1, 0)
return dataframe['a>b'].sum(), dataframe['b>a'].sum()
print(diff(df))
>>> (2, 0)
Basically what np.where() does, the way I used it, is that it produces 1 if the condition is met and 0 otherwise. You can then add those columns up using a simple sum() function applied on the desired columns.

Update
Maybe you can use:
>>> df['a'].gt(df['b']).sum(), df['b'].gt(df['a']).sum()
(2, 0)
IIUC, to fix your code:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
# Subtract the minimum of counters
m = min(a_counter, b_counter)
return a_counter-m, b_counter-m
Output:
>>> diff(df)
(2, 0)

IIUC, you can use the sign of the difference and count the values:
d = {1: 'a', -1: 'b', 0: 'equal'}
(np.sign(df['a'].sub(df['b']))
.map(d)
.value_counts()
.reindex(list(d.values()), fill_value=0)
)
output:
a 2
b 0
equal 0
dtype: int64

Row-wise difference in two list in pandas

I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.
Here's sample data and code. Once you run this code, you will get expected output:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code:
I am not a fan of this code because it's clunky and inefficient.
I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.
I'd appreciate if an expert could help me with a vectorized version of the code.
I did try the vectorized version, but I couldn't go too far.
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.
I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4

You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1

One alternative using drop duplicates and groupby
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1

How to get the position of certain columns in dataframe - Python [duplicate]

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?

Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.

Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]

DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()

For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)

To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.

how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something

import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Get column index from column name in python pandas

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?

Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.

Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]

DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()

For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)

To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.

how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something

import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas drop_duplicates using comparison function - python

Option 1 df[~pd.DataFrame(df.A.values.tolist()).duplicated()] Option 2 df[~df.A.apply(pd.Series).duplicated()]

Related

Lists become pd.Series, the again lists with one dimension more

Scripting a simple counter

Row-wise difference in two list in pandas

How to get the position of certain columns in dataframe - Python [duplicate]

Get column index from column name in python pandas

Categories

Resources