Here is a pandas DF with columns A, B, C, D
A B C D
0 1 2 1.0 a
1 1 2 1.01 a
2 1 2 1.0 b
3 3 4 0 b
4 3 4 0 c
5 1 2 1 c
6 1 9 1 c
How can I add a column to show duplicates from other rows with constraints:
exact match for A, B
float tolerance with C (within 0.05)
must not match D
A B C D Dups
0 1 2 1.0 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.0 b 0,1,5
3 3 4 0 b 4
4 3 4 0 c 3
5 1 2 1 c 0,1,2
6 1 9 1 c null
My original answer required N**2 iterations for N rows. The answer by sammywemmy loops over permutations(..., 2), which is essentially a loop over N*(N-1) combinations. The answer by warped is more efficient because it starts with a quicker matching on the A and B columns, but there is still a slow search for the conditions on the C and D columns. The number of iterations is therefore N*M where M is the average number of rows sharing the same A and B values.
If you're willing to change the requirement of "C equal +/-0.05" to "C is equal when rounded to 1 decimal", it gets better, with N*K iterations where K is the average number of rows having the same A, B, and C values. Here is one implementation; you can also adapt warped's approach.
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')
# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)
# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]
# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices
# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
# each record is a tuple with integers; can be used as key in dict
if rec in dupd:
dupd[rec].append(i)
else:
dupd[rec] = [i]
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
dups.append(tuple(dup))
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
print(df)
Output:
A B C D Dups
0 1 2 1.00 a (2, 5)
1 1 2 1.01 a (2, 5)
2 1 2 1.00 b (0, 1, 5)
3 3 4 0.00 b (4,)
4 3 4 0.00 c (3,)
5 1 2 1.00 c (0, 1, 2)
6 1 9 1.00 c ()
Here is the original answer, which scales as O(N**2), but is easy to understand:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
dups = []
for i, irow in df.iterrows():
dup = []
for j, jrow in df.iterrows():
if (i != j and
irow['A'] == jrow['A'] and
irow['B'] == jrow['B'] and
abs(irow['C']-jrow['C']) < 0.05 and
irow['D'] != jrow['D']
):
dup.append(j)
dups.append(tuple(dup))
df['Dups'] = dups
print(df)
This is far from pretty, but it does get the job done:
tolerance=0.05
dups={}
for _, group in df.groupby(['A', 'B']):
for i, row1 in group.iterrows():
data = []
for j, row2 in group.iterrows():
if i!=j:
if abs(row1['C'] - row2['C']) <= tolerance:
if row1['D'] != row2['D']:
print(i,j)
data.append(j)
dups[i] = data
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups
df
A B C D dups
0 1 2 1.00 a [2, 5]
1 1 2 1.01 a [2, 5]
2 1 2 1.00 b [0, 1, 5]
3 3 4 0.00 b [4]
4 3 4 0.00 c [3]
5 1 2 1.00 c [0, 1, 2]
6 1 9 1.00 c []
Convert to dictionary :
res = df.T.to_dict("list")
res
{0: [1, 2, 1.0, 'a'],
1: [1, 2, 1.01, 'a'],
2: [1, 2, 1.0, 'b'],
3: [3, 4, 0.0, 'b'],
4: [3, 4, 0.0, 'c'],
5: [1, 2, 1.0, 'c'],
6: [1, 9, 1.0, 'c']}
Get pairing of index and values into each sublist :
box = [(key,*value) for key, value in res.items()]
box
[(0, 1, 2, 1.0, 'a'),
(1, 1, 2, 1.01, 'a'),
(2, 1, 2, 1.0, 'b'),
(3, 3, 4, 0.0, 'b'),
(4, 3, 4, 0.0, 'c'),
(5, 1, 2, 1.0, 'c'),
(6, 1, 9, 1.0, 'c')]
Use itertools permutations along with your conditions to filter out matches :
from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]
#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0])
for first, second in permutations(phase1,2)
if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
]
phase2
[((1, 2, 1.0, 'a'), 2),
((1, 2, 1.0, 'a'), 5),
((1, 2, 1.01, 'a'), 2),
((1, 2, 1.01, 'a'), 5),
((1, 2, 1.0, 'b'), 0),
((1, 2, 1.0, 'b'), 1),
((1, 2, 1.0, 'b'), 5),
((3, 4, 0.0, 'b'), 4),
((3, 4, 0.0, 'c'), 3),
((1, 2, 1.0, 'c'), 0),
((1, 2, 1.0, 'c'), 1),
((1, 2, 1.0, 'c'), 2)]
Get the pairings via defaultdict :
from collections import defaultdict
d = defaultdict(list)
for k, v in phase2:
d[k].append(v)
d
defaultdict(list,
{(1, 2, 1.0, 'a'): [2, 5],
(1, 2, 1.01, 'a'): [2, 5],
(1, 2, 1.0, 'b'): [0, 1, 5],
(3, 4, 0.0, 'b'): [4],
(3, 4, 0.0, 'c'): [3],
(1, 2, 1.0, 'c'): [0, 1, 2]})
Combine values in d to string :
e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e
[(1, 2, 1.0, 'a', '2,5'),
(1, 2, 1.01, 'a', '2,5'),
(1, 2, 1.0, 'b', '0,1,5'),
(3, 4, 0.0, 'b', '4'),
(3, 4, 0.0, 'c', '3'),
(1, 2, 1.0, 'c', '0,1,2')]
Create dataframe from extract :
cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)
Merge with original dataframe :
result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result
A B C D Dups
0 1 2 1.00 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.00 b 0,1,5
3 3 4 0.00 b 4
4 3 4 0.00 c 3
5 1 2 1.00 c 0,1,2
6 1 9 1.00 c NaN
I have the foll. python numpy array, arr:
([1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L])
I can find the first occurrence of 1 like this:
np.where(arr.squeeze() == 1)[0]
How do I find the position of the last 1 before either a 0 or 3.
Here's one approach using np.where and np.in1d -
# Get the indices of places with 0s or 3s and this
# decides the last index where we need to look for 1s later on
last_idx = np.where(np.in1d(arr,[0,3]))[0][-1]
# Get all indices of 1s within the range of last_idx and choose the last one
out = np.where(arr[:last_idx]==1)[0][-1]
Please note that for cases where no indices are found, using something like [0][-1] would complain about having no elements, so error-checking codes are needed to be wrapped around those lines.
Sample run -
In [118]: arr
Out[118]: array([1, 1, 3, 0, 3, 2, 0, 1, 2, 1, 0, 2, 2, 3, 2])
In [119]: last_idx = np.where(np.in1d(arr,[0,3]))[0][-1]
In [120]: np.where(arr[:last_idx]==1)[0][-1]
Out[120]: 9
You can use a rolling window and search that for the values you want:
import numpy as np
arr = np.array([1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
match_1_0 = np.all(rolling_window(arr, 2) == [1, 0], axis=1)
match_1_3 = np.all(rolling_window(arr, 2) == [1, 3], axis=1)
all_matches = np.logical_or(match_1_0, match_1_3)
print(np.flatnonzero(all_matches)[-1])
Depending on your arrays, this might be good enough performance-wise. With that said, a less flexible (but simpler) solution might perform better even if it is a loop over indices that you usually want to avoid with numpy...:
for ix in xrange(len(arr) - 2, -1, -1): # range in python3.x
if arr[ix] == 1 and (arr[ix + 1] == 0 or arr[ix + 1] == 3):
return ix
You might even be able to do something like which is probably a bit more flexible than the hard-coded solution above and (I would guess) still probably would out-perform the rolling window solution:
def rfind(haystack, needle):
len_needle = len(needle)
for ix in xrange(len(haystack) - len_needle, -1, -1): # range in python3.x
if (haystack[ix:ix + len_needle] == needle).all():
return ix
Here, you'd do something like:
max(rfind(arr, np.array([1, 0])), rfind(arr, np.array([1, 3])))
And of course, with all of these answers, I haven't actually handled the case where the thing you are searching for isn't present since you didn't specify what you would want for that case...
I would like to convert a list of Python dictionaries into a SciPy sparse matrix.
I know I can use sklearn.feature_extraction.DictVectorizer.fit_transform():
import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5},
{"feat4": 2.1, "feat5": 0.3, "feat7": 0.1},
{"feat2": 7.5}]
v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
X = v.fit_transform(feature_dictionary)
print('X: \n{0}'.format(X))
which outputs:
X:
(0, 0) 1.5
(0, 1) 0.5
(1, 3) 2.1
(1, 4) 0.3
(1, 5) 0.1
(2, 2) 7.5
However, I'd like feat1 to be in column 1, feat10 in column 10, feat4 in column 4, and so on. How can I achieve that?
You could manually set sklearn.feature_extraction.DictVectorizer.vocabulary_ and sklearn.feature_extraction.DictVectorizer.fit.feature_names_ instead of learning them through sklearn.feature_extraction.DictVectorizer.fit():
import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5}, {"feat4": 2.1, "feat5": 0.3, "feat7": 0.1}, {"feat2": 7.5}]
v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
v.vocabulary_ = {'feat0': 0, 'feat1': 1, 'feat2': 2, 'feat3': 3, 'feat4': 4, 'feat5': 5,
'feat6': 6, 'feat7': 7, 'feat8': 8, 'feat9': 9, 'feat10': 10}
v.feature_names_ = ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7',
'feat8', 'feat9', 'feat10']
X = v.transform(feature_dictionary)
print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))
print('X: \n{0}'.format(X))
outputs:
X:
(0, 1) 1.5
(0, 10) 0.5
(1, 4) 2.1
(1, 5) 0.3
(1, 7) 0.1
(2, 2) 7.5
Obviously you don't have to define vocabulary_ and feature_names_ manually:
v.vocabulary_ = {}
v.feature_names_ = []
number_of_features = 11
for feature_number in range(number_of_features):
feature_name = 'feat{0}'.format(feature_number)
v.vocabulary_[feature_name] = feature_number
v.feature_names_.append(feature_name)
print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))
outputs:
v.vocabulary_ : {'feat10': 10, 'feat9': 9, 'feat8': 8, 'feat5': 5, 'feat4': 4, 'feat7': 7,
'feat6': 6, 'feat1': 1, 'feat0': 0, 'feat3': 3, 'feat2': 2}
v.feature_names_: ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7',
'feat8', 'feat9', 'feat10']
I have this csr matrix:
(0, 12114) 4
(0, 12001) 1
(0, 11998) 2
(0, 11132) 1
(0, 10412) 7
(1, 10096) 3
(1, 10085) 1
(1, 9105) 8
(1, 8925) 5
(1, 8660) 2
(2, 6577) 2
(2, 6491) 4
(3, 6178) 8
(3, 5286) 1
(3, 5147) 7
(3, 4466) 3
And this list of dictionaries:
[{11998: 0.27257158100079237, 12114: 0.27024630707640002},
{10085: 0.23909781233007368, 9105: 0.57533007741289421},
{6577: 0.45085059256989168, 6491: 0.5895717192325539},
{5286: 0.4482789582819417, 6178: 0.32295433881928487}]
I'd like to find a way to search each dictionary in the list against the corresponding row in the matrix (e.g. row 0 against first dictionary) and replace each value in the dictionary with the value from the matrix, according to the key...
So the result would be:
[{11998: 2, 12114: 4},
{10085: 1, 9105: 8},
{6577: 2, 6491: 4},
{5286: 1, 6178: 8}]
If X is your sparse matrix and
D = [{11998: 0.27257158100079237, 12114: 0.27024630707640002},
{10085: 0.23909781233007368, 9105: 0.57533007741289421},
{6577: 0.45085059256989168, 6491: 0.5895717192325539},
{5286: 0.4482789582819417, 6178: 0.32295433881928487}]
then
for i, d in enumerate(D):
for j in d:
d[j] = X[i, j]
gives the desired result:
>>> D
[{12114: 4.0, 11998: 2.0}, {9105: 8.0, 10085: 1.0}, {6577: 2.0, 6491: 4.0}, {6178: 8.0, 5286: 1.0}]