I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.
Here's sample data and code. Once you run this code, you will get expected output:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code:
I am not a fan of this code because it's clunky and inefficient.
I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.
I'd appreciate if an expert could help me with a vectorized version of the code.
I did try the vectorized version, but I couldn't go too far.
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.
I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4
You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
One alternative using drop duplicates and groupby
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
Related
I have two arrays of the same size. One, call it A, contains a series of repeated numbers; the other, B contains random numbers.
import numpy as np
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
I need to find the differences in B between the two extremes defined by the groups in A. More specifically, I need an output C such as
C = [2, -2, 2, 1]
where each term is the difference 3 - 1, 4 - 6, 9 - 7, and 11 - 10, i.e., the difference between the extremes in B identified by the groups of repeated numbers in A.
I tried to play around with itertools.groupby to isolate the groups in the first array, but it is not clear to me how to exploit the indexing to operate the differences in the second.
Edit: C is now sorted the same way as in the question
C = []
_, idx = np.unique(A, return_index=True)
for i in A[np.sort(idx)]:
bs = B[A==i]
C.append(bs[-1] - bs[0])
print(C) // [2, -2, 2, 1]
np.unique returns, for each unique value in A, the index of the first appearance of it.
i in A[np.sort(idx)] iterates over the unique values in the order of the indexes.
B[A==i] extracts the values from B at the same indexes as those values in A.
This is easily achieved using pandas' groupby:
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
import pandas as pd
pd.Series(B).groupby(A, sort=False).agg(lambda g: g.iloc[-1]-g.iloc[0]).to_numpy()
output: array([ 2, -2, 2, 1])
using itertools.groupby:
from itertools import groupby
[(x:=list(g))[-1][1]-x[0][1] for k, g in groupby(zip(A,B), lambda x: x[0])]
output: [2, -2, 2, 1]
NB. Note that the two solutions will behave differently if there are different non-consecutive groups
In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results
I want to create a new list that is the sum of the columns of the previous lists.
I have a lot of different lists and I would like to sum up all of the different lists in the most efficient way possible. Below is an example of the issue I am trying to solve:
list[0] = [2,4,1,6,7]
list[1] = [3,1,2,11,0]
list[2] = [2,4,2,2,1]
...
list[999] = [4,2,5,6,7]
The newlist would then look something like this:
Newlist = [1340,1525,675,1825,895]
What would be the best way to create the new list.
What do you think of this solution:
import numpy as np
a = list()
a.append([2, 4, 1, 6, 7]) # a[0]
a.append([3, 1, 2, 11, 0]) # a[1]
a.append([2, 4, 2, 2, 1]) # a[2]
# 1st solution
rslt_1 = a[0]
for _ in range(1, len(a)):
rslt_1 = np.add(rslt_1, a[_])
# 2nd solution
rslt_2 = np.sum(a, axis=0)
print("Rslt_1:", rslt_1)
print("Rslt_2:", rslt_2)
Returns:
Rslt_1: [ 7 9 5 19 8]
Rslt_2: [ 7 9 5 19 8]
Is it somehow possible to use pandas.drop_duplicates with a comparison operator which compares two objects in a particular column in order to identify duplicates? If not, what is the alternative?
Here is an example where it could be used:
I have a pandas DataFrame which has lists as values in a particular column and I would like to have duplicates removed based on column A
import pandas as pd
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2]]} )
print df
giving me
A
0 [1, 2]
1 [2, 3]
2 [1, 2]
Using pandas.drop_duplicates
df.drop_duplicates( 'A' )
gives me a TypeError
[...]
TypeError: type object argument after * must be a sequence, not itertools.imap
However, my desired result is
A
0 [1, 2]
1 [2, 3]
My comparison function would be here:
def cmp(x,y):
return x==y
But in principle it could be something else, e.g.,
def cmp(x,y):
return x==y and len(x)>1
How can I remove duplicates based on the comparison function in a efficient way?
Even more, what could I do if I had more columns to compare using a different comparison function, respectively?
Option 1
df[~pd.DataFrame(df.A.values.tolist()).duplicated()]
Option 2
df[~df.A.apply(pd.Series).duplicated()]
IIUC, your question is how to use an arbitrary function to determine what is a duplicate. To emphasize this, let's say that two lists are duplicates if the sum of the first item, plus the square of the second item, is the same in each case
In [59]: In [118]: df = pd.DataFrame( {'A': [[1,2],[4,1],[2,3]]} )
(Note that the first and second lists are equivalent, although not same.)
Python typically prefers key functions to comparison functions, so here we need a function to say what is the key of a list; in this case, it is lambda l: l[0] + l[1]**2.
We can use groupby + first to group by the values of the key function, then take the first of each group:
In [119]: df.groupby(df.A.apply(lambda l: l[0] + l[1]**2)).first()
Out[119]:
A
A
5 [1, 2]
11 [2, 3]
Edit
Following further edits in the question, here are a few more examples using
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2], [1], [1], [2]]} )
Then for
def cmp(x,y):
return x==y
this could be
In [158]: df.groupby(df.A.apply(tuple)).first()
Out[158]:
A
A
(1,) [1]
(1, 2) [1, 2]
(2,) [2]
(2, 3) [2, 3]
for
def cmp(x,y):
return x==y and len(x)>1
this could be
In [184]: class Key(object):
.....: def __init__(self):
.....: self._c = 0
.....: def __call__(self, l):
.....: if len(l) < 2:
.....: self._c += 1
.....: return self._c
.....: return tuple(l)
.....:
In [187]: df.groupby(df.A.apply(Key())).first()
Out[187]:
A
A
1 [1]
2 [1]
3 [2]
(1, 2) [1, 2]
(2, 3) [2, 3]
Alternatively, this could also be done much more succinctly via
In [190]: df.groupby(df.A.apply(lambda l: np.random.rand() if len(l) < 2 else tuple(l))).first()
Out[190]:
A
A
0.112012068449 [2]
0.822889598152 [1]
0.842630848774 [1]
(1, 2) [1, 2]
(2, 3) [2, 3]
but some people don't like these Monte-Carlo things.
Lists are unhashable in nature. Try converting them to hashable types such as tuples and then you can continue to use drop_duplicates:
df['A'] = df['A'].map(tuple)
df.drop_duplicates('A').applymap(list)
One way of implementing it using a function would be based on computing value_counts of the series object, as duplicated values get aggregated and we are interested in only the index part (which by the way is unique) and not the actual count part.
def series_dups(col_name):
ser = df[col_name].map(tuple).value_counts(sort=False)
return (pd.Series(data=ser.index.values, name=col_name)).map(list)
series_dups('A')
0 [1, 2]
1 [2, 3]
Name: A, dtype: object
If you do not want to convert the values to tuple but rather process the values as they are, you could do:
Toy data:
df = pd.DataFrame({'A': [[1,2], [2,3], [1,2], [3,4]],
'B': [[10,11,12], [11,12], [11,12,13], [10,11,12]]})
df
def series_dups_hashable(frame, col_names):
for col in col_names:
ser, indx = np.unique(frame[col].values, return_index=True)
frame[col] = pd.Series(data=ser, index=indx, name=col)
return frame.dropna(how='all')
series_dups_hashable(df, ['A', 'B']) # Apply to subset/all columns you want to check
In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results