Pandas: add indicator for duplicate on columns - python

Here is a pandas DF with columns A, B, C, D
A B C D
0 1 2 1.0 a
1 1 2 1.01 a
2 1 2 1.0 b
3 3 4 0 b
4 3 4 0 c
5 1 2 1 c
6 1 9 1 c
How can I add a column to show duplicates from other rows with constraints:
exact match for A, B
float tolerance with C (within 0.05)
must not match D
A B C D Dups
0 1 2 1.0 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.0 b 0,1,5
3 3 4 0 b 4
4 3 4 0 c 3
5 1 2 1 c 0,1,2
6 1 9 1 c null

My original answer required N**2 iterations for N rows. The answer by sammywemmy loops over permutations(..., 2), which is essentially a loop over N*(N-1) combinations. The answer by warped is more efficient because it starts with a quicker matching on the A and B columns, but there is still a slow search for the conditions on the C and D columns. The number of iterations is therefore N*M where M is the average number of rows sharing the same A and B values.
If you're willing to change the requirement of "C equal +/-0.05" to "C is equal when rounded to 1 decimal", it gets better, with N*K iterations where K is the average number of rows having the same A, B, and C values. Here is one implementation; you can also adapt warped's approach.
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')
# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)
# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]
# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices
# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
# each record is a tuple with integers; can be used as key in dict
if rec in dupd:
dupd[rec].append(i)
else:
dupd[rec] = [i]
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
dups.append(tuple(dup))
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
print(df)
Output:
A B C D Dups
0 1 2 1.00 a (2, 5)
1 1 2 1.01 a (2, 5)
2 1 2 1.00 b (0, 1, 5)
3 3 4 0.00 b (4,)
4 3 4 0.00 c (3,)
5 1 2 1.00 c (0, 1, 2)
6 1 9 1.00 c ()
Here is the original answer, which scales as O(N**2), but is easy to understand:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
dups = []
for i, irow in df.iterrows():
dup = []
for j, jrow in df.iterrows():
if (i != j and
irow['A'] == jrow['A'] and
irow['B'] == jrow['B'] and
abs(irow['C']-jrow['C']) < 0.05 and
irow['D'] != jrow['D']
):
dup.append(j)
dups.append(tuple(dup))
df['Dups'] = dups
print(df)

This is far from pretty, but it does get the job done:
tolerance=0.05
dups={}
for _, group in df.groupby(['A', 'B']):
for i, row1 in group.iterrows():
data = []
for j, row2 in group.iterrows():
if i!=j:
if abs(row1['C'] - row2['C']) <= tolerance:
if row1['D'] != row2['D']:
print(i,j)
data.append(j)
dups[i] = data
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups
df
A B C D dups
0 1 2 1.00 a [2, 5]
1 1 2 1.01 a [2, 5]
2 1 2 1.00 b [0, 1, 5]
3 3 4 0.00 b [4]
4 3 4 0.00 c [3]
5 1 2 1.00 c [0, 1, 2]
6 1 9 1.00 c []

Convert to dictionary :
res = df.T.to_dict("list")
res
{0: [1, 2, 1.0, 'a'],
1: [1, 2, 1.01, 'a'],
2: [1, 2, 1.0, 'b'],
3: [3, 4, 0.0, 'b'],
4: [3, 4, 0.0, 'c'],
5: [1, 2, 1.0, 'c'],
6: [1, 9, 1.0, 'c']}
Get pairing of index and values into each sublist :
box = [(key,*value) for key, value in res.items()]
box
[(0, 1, 2, 1.0, 'a'),
(1, 1, 2, 1.01, 'a'),
(2, 1, 2, 1.0, 'b'),
(3, 3, 4, 0.0, 'b'),
(4, 3, 4, 0.0, 'c'),
(5, 1, 2, 1.0, 'c'),
(6, 1, 9, 1.0, 'c')]
Use itertools permutations along with your conditions to filter out matches :
from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]
#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0])
for first, second in permutations(phase1,2)
if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
]
phase2
[((1, 2, 1.0, 'a'), 2),
((1, 2, 1.0, 'a'), 5),
((1, 2, 1.01, 'a'), 2),
((1, 2, 1.01, 'a'), 5),
((1, 2, 1.0, 'b'), 0),
((1, 2, 1.0, 'b'), 1),
((1, 2, 1.0, 'b'), 5),
((3, 4, 0.0, 'b'), 4),
((3, 4, 0.0, 'c'), 3),
((1, 2, 1.0, 'c'), 0),
((1, 2, 1.0, 'c'), 1),
((1, 2, 1.0, 'c'), 2)]
Get the pairings via defaultdict :
from collections import defaultdict
d = defaultdict(list)
for k, v in phase2:
d[k].append(v)
d
defaultdict(list,
{(1, 2, 1.0, 'a'): [2, 5],
(1, 2, 1.01, 'a'): [2, 5],
(1, 2, 1.0, 'b'): [0, 1, 5],
(3, 4, 0.0, 'b'): [4],
(3, 4, 0.0, 'c'): [3],
(1, 2, 1.0, 'c'): [0, 1, 2]})
Combine values in d to string :
e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e
[(1, 2, 1.0, 'a', '2,5'),
(1, 2, 1.01, 'a', '2,5'),
(1, 2, 1.0, 'b', '0,1,5'),
(3, 4, 0.0, 'b', '4'),
(3, 4, 0.0, 'c', '3'),
(1, 2, 1.0, 'c', '0,1,2')]
Create dataframe from extract :
cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)
Merge with original dataframe :
result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result
A B C D Dups
0 1 2 1.00 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.00 b 0,1,5
3 3 4 0.00 b 4
4 3 4 0.00 c 3
5 1 2 1.00 c 0,1,2
6 1 9 1.00 c NaN

Related

Python: How to get the statistics of the position of each item in multiple lists?

I want to analyze the sequences items of the items and the positions in the sequence where the item appear.
For example:
dataframe['sequence_list'][0] = ['a','b', 'f', 'e']
dataframe['sequence_list'][1] = ['a','c', 'd', 'e']
dataframe['sequence_list'][2] = ['a','d']
...
dataframe['sequence_list'][i] = ['a','b', 'c']
What I want to get is:
How many times 'a' appear in position 0, 1, 2, 3 of the list ?
How many times 'b' appear in position 0, 1, 2, 3 of the list ?
...
Output would be like:
output[1,'a'] = 4
output[2,'a'] = 0
output[3,'a'] = 0
output[4,'a'] = 0
output[1,'b'] = 2
...
The output format could be different. I just want to tell if there are any quick matrix computing methodology help me get the stats quickly?
Start by converting the lists into Series using one of the two statements:
df_ser = dataframe.sequence_list.apply(pd.Series)
df_ser = pd.DataFrame(dataframe.sequence_list.tolist()) # ~30% faster?
The columns of the new dataframe are item positions for each row:
# 0 1 2 3
#0 a b f e
#1 a c d e
#2 a d NaN NaN
#3 a b c NaN
Convert the column numbers into the second-level index, then the second-level index into a column of its own:
df_col = df_ser.stack().reset_index(level=1)
# level_1 0
#0 0 a
#0 1 b
#0 2 f
#....
Count the combinations. This is your answer:
output = df_col.groupby(['level_1', 0]).size()
#level_1 0
#0 a 4
#1 b 2
# c 1
# d 1
#2 c 1
# d 1
# f 1
#3 e 2
You can have it as dictionary:
output.to_dict()
#{(0, 'a'): 4, (1, 'b'): 2, (1, 'c'): 1, (1, 'd'): 1,
# (2, 'c'): 1, (2, 'd'): 1, (2, 'f'): 1, (3, 'e'): 2}
All in one line:
dataframe.sequence_list.apply(pd.Series)\
.stack().reset_index(level=1)\
.groupby(['level_1',0]).size().to_dict()
Setup
Using the setup
df = pd.DataFrame({'col': [['a','b', 'f', 'e'], ['a','c', 'd', 'e'], ['a','d'], ['a','b', 'c']]})
col
0 [a, b, f, e]
1 [a, c, d, e]
2 [a, d]
3 [a, b, c]
You can apply+Counter
pd.DataFrame(df.col.tolist()).apply(Counter)
which yields
0 {'a': 4}
1 {'b': 2, 'c': 1, 'd': 1}
2 {'f': 1, 'd': 1, None: 1, 'c': 1}
3 {'e': 2, None: 2}
dtype: object
for each index.
You can just parse your data the way you need, e.g. fill your dicts now to add the zeroes or disconsider, if thats the case, the Nones.

Creating a subset of array from another array : Python

I have a basic question regarding working with arrays:
a= ([ c b a a b b c a a b b c a a b a c b])
b= ([ 0 1 0 1 0 0 0 0 2 0 1 0 2 0 0 1 0 1])
I) Is there a short way, to count the number of time 'c' in a corresponds to 0, 1, and 2 in b and 'b' in a corresponds to 0, 1, 2 and so on
II) How do I create a new array c (subset of a) and d(subset of b) such that it only contains those elements if the corresponding element in a is 'c' ?
In [10]: p = ['a', 'b', 'c', 'a', 'c', 'a']
In [11]: q = [1, 2, 1, 3, 3, 1]
In [12]: z = zip(p, q)
In [13]: z
Out[13]: [('a', 1), ('b', 2), ('c', 1), ('a', 3), ('c', 3), ('a', 1)]
In [14]: counts = {}
In [15]: for pair in z:
...: if pair in counts.keys():
...: counts[pair] += 1
...: else:
...: counts[pair] = 1
...:
In [16]: counts
Out[16]: {('a', 1): 2, ('a', 3): 1, ('b', 2): 1, ('c', 1): 1, ('c', 3): 1}
In [17]: sub_p = []
In [18]: sub_q = []
In [19]: for i, element in enumerate(p):
...: if element == 'a':
...: sub_p.append(element)
...: sub_q.append(q[i])
In [20]: sub_p
Out[20]: ['a', 'a', 'a']
In [21]: sub_q
Out[21]: [1, 3, 1]
Explanation
zip takes two lists and runs a figurative zipper between them. Resulting in a list of tuples
I've used a simplistic approach, I'm just maintaining a map/dictionary that makes not of how many times it has seen a pair of char-int tuples
Then I make 2 sub lists that you can modify to use the character in question and figure out what it maps to
Alternative methods
As abarnert suggested you could use A Counter from collections instead.
Or you could just a count method on z . eg: z.count('a',1). Or you can use a defaultdict instead.
The questions are a bit vague but here's a quick method (some would call it dirty) using Pandas though I think something written without recourse to Pandas should be preferred.
import pandas as pd
#create OP's lists
a= [ 'c', 'b', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'a', 'c', 'b']
b= [ 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 1]
#dump lists to a Pandas DataFrame
df = pd.DataFrame({'a':a, 'b':b})
Question 1
provided I interpreted it correctly, you can cross-tabulate the two arrays:
pd.crosstab(df.a, df.b).stack(). Cross-tabulate basically counts the number of times each number corresponds to a particular letter. .stack is a command to turn output from .crosstab into a more legible format.
#question 1
pd.crosstab(df.a, df.b).stack()
## -- End pasted text --
Out[9]:
a b
a 0 3
1 2
2 2
b 0 4
1 3
2 0
c 0 4
1 0
2 0
dtype: int64
Question 2
Here, I use Pandas boolean indexing ability to only select the elements in array a that correspond to value 'c'. So df.a=='c' will return True for every value in a that is 'c' and False otherwise. df.loc[df.a=='c','a'] will return values from a for which the boolean statement was true.
c = df.loc[df.a == 'c', 'a']
d = df.loc[df.a == 'c', 'b']
In [15]: c
Out[15]:
0 c
6 c
11 c
16 c
Name: a, dtype: object
In [16]: d
Out[16]:
0 0
6 0
11 0
16 0
Name: b, dtype: int64
Python List : https://www.tutorialspoint.com/python/python_lists.htm has a count method.
I suggest you to first zip both lists, as said in comments, and then count occurances of tuple c, 1 and occurances of tuple c, 0 and sum them up, thats what you need for (I), basically.
For (II), if I understood you correctly, you have to take the zipped lists and apply filter on them with lambda x: x[0]==x[1]

Insert column in data frame with duplicated axis

What would be the workaround (or the more tidy way) to insert a column in a pandas dataframe where some indices are duplicated?
For example, having the following dataframe:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df1 = df1.set_index([0])
df1
1 2
0
1 51 R
2 51 R
3 74 R
4 29 R
1 39 F
2 3 F
3 14 F
4 16 F
how can I insert the column foo from df2 (below) in df1?
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df2 = df2.set_index([0])
df2
foo 2
0
1 5 R
2 5 R
3 7 R
4 2 R
1 3 F
3 1 F
4 1 F
Note that the index 2 is missing from category F.
I would like the result to be something like:
1 foo 2
0
1 51 5 R
2 51 5 R
3 74 7 R
4 29 2 R
1 39 3 F
2 3 NaN F
3 14 1 F
4 16 1 F
I tried the DataFrame.insert method but am getting
df1.insert(2, 'FOO', df2['foo'])
ValueError: cannot reindex from a duplicate axis
The index and column 2 uniquely define a row on both data frames, you can do a join on the two columns (after resetting the index):
df1.reset_index().merge(df2.reset_index(), how='left', on=[0,2]).set_index([0])
# 1 2 foo
#0
#1 51 R 5.0
#2 51 R 5.0
#3 74 R 7.0
#4 29 R 2.0
#1 39 F 3.0
#2 3 F NaN
#3 14 F 1.0
#4 16 F 1.0
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df1 = df1.set_index([0, 2])
df2 = df2.set_index([0, 2])
df1.join(df2, how='left').reset_index(level=2)
2 1 foo
0
1 R 51 5.0
2 R 51 5.0
3 R 74 7.0
4 R 29 2.0
1 F 39 3.0
2 F 3 NaN
3 F 14 1.0
4 F 16 1.0
You're very close...
As you already know based on your question, you can't do this for reasons clearly stated in the error, because you have a repeated index. If you must have column '0' as the index, then don't set it as the index before your merge, set it after:
df1 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 2, 3, 4),
1: (51, 51, 74, 29, 39, 3, 14, 16),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F', 'F']) })
df2 = pd.DataFrame({ 0: (1, 2, 3, 4, 1, 3, 4),
'foo': (5, 5, 7, 2, 3, 1, 1),
2: pd.Categorical(['R', 'R', 'R', 'R', 'F', 'F', 'F']) })
df = df1.merge(df2, how='left')
df.set_index([0])

groupby multiple value columns

I need to do a fuzzy groupby where a single record can be in one or more groups.
I have a DataFrame like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I wish to have groups like this:
The group keys should be the union of the unique values between score1 and score2. Record 0 should be in groups a and b because it contains both score values. Similarly record 1 should be in groups b and a; record 2 should be in groups c and k and so on.
I've tried doing a groupby on two columns like this:
In [192]: score_groups = pd.groupby(['score1', 'score2'])
However I get the group keys as tuples - (1, 2), (2, 1), (3, 8), etc, instead of unique group keys where records can be in multiple groups. The output is shown below:
In [192]: score_groups.groups
Out[192]: {('a', 'b'): [0],
('b', 'a'): [1],
('c', 'k'): [2],
('d', 'n'): [3],
('e', 'c'): [4]}
Also, I need the indexes preserved because I'm using them for another operation later.
Please help!
Combine the two columns in a single column using e.g. pd.concat():
s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index()
s.columns = ['val', 'grp']
val grp
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 b
6 1 a
7 2 k
8 3 n
9 4 c
And then .groupby() on 'grp' and collect 'val' in a list:
s = s.groupby('grp').apply(lambda x: x.val.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
or, if you prefer dict:
s.to_dict()
{'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]}
Or, to the same effect in a single step, skipping renaming the columns:
test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
Using Stefan's help, I solved it like this.
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indices. The indexes have been preserved, which is what I wanted. Now, lets get to business - the group by operation.
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
In (283): groups.get_group('c') # Get group with key c
Out[283]:
score1
2 c
4 c
In (283): groups.get_group('k') # Get group with key k
Out[283]:
score1
2 k
I'm baffled by how pandas retrieves rows with the correct index even though they are duplicated. As I understand, the group by operation uses an inverted index data structure to store the references (indexes) to rows. Any insights would be greatly appreciated. Anyone who answers this will have their answer accepted :)
Reorganizing your data for ease of manipulation (having multiple value columns for the same data will always cause you headaches).
import pandas as pd
test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])})
test['name'] = test.index
result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2'])
name variable value
0 a score1 1
1 b score1 2
2 c score1 3
3 d score1 4
4 e score1 5
5 a score2 2
6 b score2 1
7 c score2 8
8 d score2 9
9 e score2 3
Now you have only one column for your value and it's easy to groupby score or select by your name column:
hey = result.groupby('value')
hey.groups
#below are the indices that you care about
{1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]}

How to cross check a python list of dictionaries against a csr matrix

I have this csr matrix:
(0, 12114) 4
(0, 12001) 1
(0, 11998) 2
(0, 11132) 1
(0, 10412) 7
(1, 10096) 3
(1, 10085) 1
(1, 9105) 8
(1, 8925) 5
(1, 8660) 2
(2, 6577) 2
(2, 6491) 4
(3, 6178) 8
(3, 5286) 1
(3, 5147) 7
(3, 4466) 3
And this list of dictionaries:
[{11998: 0.27257158100079237, 12114: 0.27024630707640002},
{10085: 0.23909781233007368, 9105: 0.57533007741289421},
{6577: 0.45085059256989168, 6491: 0.5895717192325539},
{5286: 0.4482789582819417, 6178: 0.32295433881928487}]
I'd like to find a way to search each dictionary in the list against the corresponding row in the matrix (e.g. row 0 against first dictionary) and replace each value in the dictionary with the value from the matrix, according to the key...
So the result would be:
[{11998: 2, 12114: 4},
{10085: 1, 9105: 8},
{6577: 2, 6491: 4},
{5286: 1, 6178: 8}]
If X is your sparse matrix and
D = [{11998: 0.27257158100079237, 12114: 0.27024630707640002},
{10085: 0.23909781233007368, 9105: 0.57533007741289421},
{6577: 0.45085059256989168, 6491: 0.5895717192325539},
{5286: 0.4482789582819417, 6178: 0.32295433881928487}]
then
for i, d in enumerate(D):
for j in d:
d[j] = X[i, j]
gives the desired result:
>>> D
[{12114: 4.0, 11998: 2.0}, {9105: 8.0, 10085: 1.0}, {6577: 2.0, 6491: 4.0}, {6178: 8.0, 5286: 1.0}]

Categories

Resources