R seems Weird, Python seems goes well in the same way - python

The following codes run the result of unexpected, i think that is somewhat weird, firstly define the featfun():
featfun <- function(yi_1, yi, i) {
all_fea <- list(c(1, 2, 2),
c(1, 2, 3),
c(1, 1, 2),
c(2, 1, 3),
c(2, 1, 2),
c(2, 2, 3),
c( 1, 1),
c( 2, 1),
c( 2, 2),
c( 1, 2),
c( 1, 3),
c( 2, 3))
weights <- c(1,1,0.6,1,1,0.2,1,0.5,0.5,0.8,0.8,0.5)
idx1 <- 0; idx2 <- 0
if (list(c(yi_1, yi, i)) %in% all_fea) {
idx1 <- which(all_fea %in% list(c(yi_1, yi, i)))
}
if (list(c(yi, i)) %in% all_fea) {
idx2 <- which(all_fea %in% list(c(yi, i)))
}
if (idx1 != 0 & idx2 != 0) {
return(list(c(1, weights[idx1]), c(1, weights[idx2])))
} else if (idx1 != 0 & idx2 == 0) {
return(list(c(1, weights[idx1])))
} else if (idx1 == 0 & idx2 != 0) {
return(list(c(1, weights[idx2])))
} else {
return(NA)
}
}
> featfun(1,1,2)
[[1]]
[1] 1.0 0.6
[[2]]
[1] 1.0 0.8
I combine the featfun() with for loops:
> for (k in seq(2,3)) {
+ cat("k=",k,"\n")
+ for (i in seq(1, 2)) {
+ cat("i=", i,"\n")
+ print(featfun(1, i, k))
+ }
+ }
k= 2
i= 1
[[1]]
[1] 1.0 0.6
i= 2
[[1]]
[1] 1 1
[[2]]
[1] 1.0 0.5
k= 3
i= 1
[[1]]
[1] 1.0 0.8
i= 2
[[1]]
[1] 1 1
As we can see, when k = 2,i = 1, it only return the first element “[1] 1.0 0.6” , and the second element is missing, it is not the same as the result of featfun(1,1,2).
Further, I rewrite the codes by using python. Following is the python codes:
def featfun(yi_1, yi, i):
all_fea = [
[1,2,2],
[1,2,3],
[1,1,2],
[2,1,3],
[2,1,2],
[2,2,3],
[ 1,1],
[ 2,1],
[ 2,2],
[ 1,2],
[ 1,3],
[ 2,3]]
weights = [1,1,0.6,1,1,0.2,1,0.5,0.5,0.8,0.8,0.5]
idx1 = 999
idx2 = 999
if [yi_1,yi,i] in all_fea:
idx1 = all_fea.index([yi_1, yi, i])
if [yi, i] in all_fea:
idx2 = all_fea.index([yi, i])
if (idx1!=999)&(idx2!=999):
return [[1,weights[idx1]],[1,weights[idx2]]]
elif (idx1!=999)&(idx2==999):
return [1,weights[idx1]]
elif (idx1==999)&(idx2!=999):
return [1,weights[idx2]]
else:
return None
featfun(1,1,2) returns [[1, 0.6], [1, 0.8]].
then I combine python_based featfun with for loops again:
for k in [2,3]:
for i in [1,2]:
return featfun(1,i,k)
following is the return results, the correct result, that is the same as the answer in textbook.
[[1, 0.6], [1, 0.8]]
[[1, 1], [1, 0.5]]
[1, 0.8]
[[1, 1], [1, 0.5]]
what happen with my R codes ? Or it seems that something wrong in R?
I hope someone can help me! Thanks!

Okay, I'm not fully sure why this issue is coming up, but it'a numerical precision issue. When you use seq(1,2) or seq(2,3) they are integers, the all_fea list is numeric, and for some reason (this is unusual) the matching isn't working because of that. If you make the all_fea list items integers, then it works:
all_fea <- list(c(1L, 2L, 2L),
c(1L, 2L, 3L),
c(1L, 1L, 2L),
c(2L, 1L, 3L),
c(2L, 1L, 2L),
c(2L, 2L, 3L),
c( 1L, 1L),
c( 2L, 1L),
c( 2L, 2L),
c( 1L, 2L),
c( 1L, 3L),
c( 2L, 3L))
The above is a manual way. Alternately you could leave it as-is and add the line all_fea = lapply(all_fea, as.integer). Anyway, after that change your loop works as expected.

Related

Output of my python list / array is repeating the values. How do I remove this?

I want my code to store the x results in a list. However, it is repeating the value of x's. How do I fix this?
list = [1, 3 , 2, 0, 4, 0, 12]
non_zeros = [list for list in list if list != 0]
set_x = []
num_x = 0
for i, (prev_coeff, next_coeff) in enumerate(zip(non_zeros, non_zeros[1:])):
x = (next_coeff*3) / (prev_coeff*2)
print(f"x{i + 1} = {x}")
num_x += 1
for j in range(num_x):
set_x.append(x)
print(set_x)
My result is:
x1 = 4.5
x2 = 1.0
x3 = 3.0
x4 = 4.5
[4.5, 1.0, 1.0, 3.0, 3.0, 3.0, 4.5, 4.5, 4.5, 4.5]
How do I set the list to be:
[4.5, 1.0, 3.0, 4.5]
you need to adjust your indentation
list1 = [1, 3 , 2, 0, 4, 0, 12]
non_zeros = [t for t in list1 if t != 0]
set_x = []
num_x = 0
for i, (prev_coeff, next_coeff) in enumerate(zip(non_zeros, non_zeros[1:])):
x = (next_coeff*3) / (prev_coeff*2)
print(f"x{i + 1} = {x}")
num_x += 1
for j in range(num_x): #<--- indent this outside of your loop
set_x.append(x)
print(set_x)

Pandas: add indicator for duplicate on columns

Here is a pandas DF with columns A, B, C, D
A B C D
0 1 2 1.0 a
1 1 2 1.01 a
2 1 2 1.0 b
3 3 4 0 b
4 3 4 0 c
5 1 2 1 c
6 1 9 1 c
How can I add a column to show duplicates from other rows with constraints:
exact match for A, B
float tolerance with C (within 0.05)
must not match D
A B C D Dups
0 1 2 1.0 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.0 b 0,1,5
3 3 4 0 b 4
4 3 4 0 c 3
5 1 2 1 c 0,1,2
6 1 9 1 c null
My original answer required N**2 iterations for N rows. The answer by sammywemmy loops over permutations(..., 2), which is essentially a loop over N*(N-1) combinations. The answer by warped is more efficient because it starts with a quicker matching on the A and B columns, but there is still a slow search for the conditions on the C and D columns. The number of iterations is therefore N*M where M is the average number of rows sharing the same A and B values.
If you're willing to change the requirement of "C equal +/-0.05" to "C is equal when rounded to 1 decimal", it gets better, with N*K iterations where K is the average number of rows having the same A, B, and C values. Here is one implementation; you can also adapt warped's approach.
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')
# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)
# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]
# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices
# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
# each record is a tuple with integers; can be used as key in dict
if rec in dupd:
dupd[rec].append(i)
else:
dupd[rec] = [i]
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
dups.append(tuple(dup))
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
print(df)
Output:
A B C D Dups
0 1 2 1.00 a (2, 5)
1 1 2 1.01 a (2, 5)
2 1 2 1.00 b (0, 1, 5)
3 3 4 0.00 b (4,)
4 3 4 0.00 c (3,)
5 1 2 1.00 c (0, 1, 2)
6 1 9 1.00 c ()
Here is the original answer, which scales as O(N**2), but is easy to understand:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})
dups = []
for i, irow in df.iterrows():
dup = []
for j, jrow in df.iterrows():
if (i != j and
irow['A'] == jrow['A'] and
irow['B'] == jrow['B'] and
abs(irow['C']-jrow['C']) < 0.05 and
irow['D'] != jrow['D']
):
dup.append(j)
dups.append(tuple(dup))
df['Dups'] = dups
print(df)
This is far from pretty, but it does get the job done:
tolerance=0.05
dups={}
for _, group in df.groupby(['A', 'B']):
for i, row1 in group.iterrows():
data = []
for j, row2 in group.iterrows():
if i!=j:
if abs(row1['C'] - row2['C']) <= tolerance:
if row1['D'] != row2['D']:
print(i,j)
data.append(j)
dups[i] = data
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups
df
A B C D dups
0 1 2 1.00 a [2, 5]
1 1 2 1.01 a [2, 5]
2 1 2 1.00 b [0, 1, 5]
3 3 4 0.00 b [4]
4 3 4 0.00 c [3]
5 1 2 1.00 c [0, 1, 2]
6 1 9 1.00 c []
Convert to dictionary :
res = df.T.to_dict("list")
res
{0: [1, 2, 1.0, 'a'],
1: [1, 2, 1.01, 'a'],
2: [1, 2, 1.0, 'b'],
3: [3, 4, 0.0, 'b'],
4: [3, 4, 0.0, 'c'],
5: [1, 2, 1.0, 'c'],
6: [1, 9, 1.0, 'c']}
Get pairing of index and values into each sublist :
box = [(key,*value) for key, value in res.items()]
box
[(0, 1, 2, 1.0, 'a'),
(1, 1, 2, 1.01, 'a'),
(2, 1, 2, 1.0, 'b'),
(3, 3, 4, 0.0, 'b'),
(4, 3, 4, 0.0, 'c'),
(5, 1, 2, 1.0, 'c'),
(6, 1, 9, 1.0, 'c')]
Use itertools permutations along with your conditions to filter out matches :
from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]
#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0])
for first, second in permutations(phase1,2)
if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
]
phase2
[((1, 2, 1.0, 'a'), 2),
((1, 2, 1.0, 'a'), 5),
((1, 2, 1.01, 'a'), 2),
((1, 2, 1.01, 'a'), 5),
((1, 2, 1.0, 'b'), 0),
((1, 2, 1.0, 'b'), 1),
((1, 2, 1.0, 'b'), 5),
((3, 4, 0.0, 'b'), 4),
((3, 4, 0.0, 'c'), 3),
((1, 2, 1.0, 'c'), 0),
((1, 2, 1.0, 'c'), 1),
((1, 2, 1.0, 'c'), 2)]
Get the pairings via defaultdict :
from collections import defaultdict
d = defaultdict(list)
for k, v in phase2:
d[k].append(v)
d
defaultdict(list,
{(1, 2, 1.0, 'a'): [2, 5],
(1, 2, 1.01, 'a'): [2, 5],
(1, 2, 1.0, 'b'): [0, 1, 5],
(3, 4, 0.0, 'b'): [4],
(3, 4, 0.0, 'c'): [3],
(1, 2, 1.0, 'c'): [0, 1, 2]})
Combine values in d to string :
e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e
[(1, 2, 1.0, 'a', '2,5'),
(1, 2, 1.01, 'a', '2,5'),
(1, 2, 1.0, 'b', '0,1,5'),
(3, 4, 0.0, 'b', '4'),
(3, 4, 0.0, 'c', '3'),
(1, 2, 1.0, 'c', '0,1,2')]
Create dataframe from extract :
cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)
Merge with original dataframe :
result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result
A B C D Dups
0 1 2 1.00 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.00 b 0,1,5
3 3 4 0.00 b 4
4 3 4 0.00 c 3
5 1 2 1.00 c 0,1,2
6 1 9 1.00 c NaN

find position of last element before specific value in python numpy

I have the foll. python numpy array, arr:
([1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L])
I can find the first occurrence of 1 like this:
np.where(arr.squeeze() == 1)[0]
How do I find the position of the last 1 before either a 0 or 3.
Here's one approach using np.where and np.in1d -
# Get the indices of places with 0s or 3s and this
# decides the last index where we need to look for 1s later on
last_idx = np.where(np.in1d(arr,[0,3]))[0][-1]
# Get all indices of 1s within the range of last_idx and choose the last one
out = np.where(arr[:last_idx]==1)[0][-1]
Please note that for cases where no indices are found, using something like [0][-1] would complain about having no elements, so error-checking codes are needed to be wrapped around those lines.
Sample run -
In [118]: arr
Out[118]: array([1, 1, 3, 0, 3, 2, 0, 1, 2, 1, 0, 2, 2, 3, 2])
In [119]: last_idx = np.where(np.in1d(arr,[0,3]))[0][-1]
In [120]: np.where(arr[:last_idx]==1)[0][-1]
Out[120]: 9
You can use a rolling window and search that for the values you want:
import numpy as np
arr = np.array([1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
match_1_0 = np.all(rolling_window(arr, 2) == [1, 0], axis=1)
match_1_3 = np.all(rolling_window(arr, 2) == [1, 3], axis=1)
all_matches = np.logical_or(match_1_0, match_1_3)
print(np.flatnonzero(all_matches)[-1])
Depending on your arrays, this might be good enough performance-wise. With that said, a less flexible (but simpler) solution might perform better even if it is a loop over indices that you usually want to avoid with numpy...:
for ix in xrange(len(arr) - 2, -1, -1): # range in python3.x
if arr[ix] == 1 and (arr[ix + 1] == 0 or arr[ix + 1] == 3):
return ix
You might even be able to do something like which is probably a bit more flexible than the hard-coded solution above and (I would guess) still probably would out-perform the rolling window solution:
def rfind(haystack, needle):
len_needle = len(needle)
for ix in xrange(len(haystack) - len_needle, -1, -1): # range in python3.x
if (haystack[ix:ix + len_needle] == needle).all():
return ix
Here, you'd do something like:
max(rfind(arr, np.array([1, 0])), rfind(arr, np.array([1, 3])))
And of course, with all of these answers, I haven't actually handled the case where the thing you are searching for isn't present since you didn't specify what you would want for that case...

Converting a list of Python dictionaries into a SciPy sparse matrix

I would like to convert a list of Python dictionaries into a SciPy sparse matrix.
I know I can use sklearn.feature_extraction.DictVectorizer.fit_transform():
import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5},
{"feat4": 2.1, "feat5": 0.3, "feat7": 0.1},
{"feat2": 7.5}]
v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
X = v.fit_transform(feature_dictionary)
print('X: \n{0}'.format(X))
which outputs:
X:
(0, 0) 1.5
(0, 1) 0.5
(1, 3) 2.1
(1, 4) 0.3
(1, 5) 0.1
(2, 2) 7.5
However, I'd like feat1 to be in column 1, feat10 in column 10, feat4 in column 4, and so on. How can I achieve that?
You could manually set sklearn.feature_extraction.DictVectorizer.vocabulary_ and sklearn.feature_extraction.DictVectorizer.fit.feature_names_ instead of learning them through sklearn.feature_extraction.DictVectorizer.fit():
import sklearn.feature_extraction
feature_dictionary = [{"feat1": 1.5, "feat10": 0.5}, {"feat4": 2.1, "feat5": 0.3, "feat7": 0.1}, {"feat2": 7.5}]
v = sklearn.feature_extraction.DictVectorizer(sparse=True, dtype=float)
v.vocabulary_ = {'feat0': 0, 'feat1': 1, 'feat2': 2, 'feat3': 3, 'feat4': 4, 'feat5': 5,
'feat6': 6, 'feat7': 7, 'feat8': 8, 'feat9': 9, 'feat10': 10}
v.feature_names_ = ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7',
'feat8', 'feat9', 'feat10']
X = v.transform(feature_dictionary)
print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))
print('X: \n{0}'.format(X))
outputs:
X:
(0, 1) 1.5
(0, 10) 0.5
(1, 4) 2.1
(1, 5) 0.3
(1, 7) 0.1
(2, 2) 7.5
Obviously you don't have to define vocabulary_ and feature_names_ manually:
v.vocabulary_ = {}
v.feature_names_ = []
number_of_features = 11
for feature_number in range(number_of_features):
feature_name = 'feat{0}'.format(feature_number)
v.vocabulary_[feature_name] = feature_number
v.feature_names_.append(feature_name)
print('v.vocabulary_ : {0} ; v.feature_names_: {1}'.format(v.vocabulary_, v.feature_names_))
outputs:
v.vocabulary_ : {'feat10': 10, 'feat9': 9, 'feat8': 8, 'feat5': 5, 'feat4': 4, 'feat7': 7,
'feat6': 6, 'feat1': 1, 'feat0': 0, 'feat3': 3, 'feat2': 2}
v.feature_names_: ['feat0', 'feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7',
'feat8', 'feat9', 'feat10']

How to cross check a python list of dictionaries against a csr matrix

I have this csr matrix:
(0, 12114) 4
(0, 12001) 1
(0, 11998) 2
(0, 11132) 1
(0, 10412) 7
(1, 10096) 3
(1, 10085) 1
(1, 9105) 8
(1, 8925) 5
(1, 8660) 2
(2, 6577) 2
(2, 6491) 4
(3, 6178) 8
(3, 5286) 1
(3, 5147) 7
(3, 4466) 3
And this list of dictionaries:
[{11998: 0.27257158100079237, 12114: 0.27024630707640002},
{10085: 0.23909781233007368, 9105: 0.57533007741289421},
{6577: 0.45085059256989168, 6491: 0.5895717192325539},
{5286: 0.4482789582819417, 6178: 0.32295433881928487}]
I'd like to find a way to search each dictionary in the list against the corresponding row in the matrix (e.g. row 0 against first dictionary) and replace each value in the dictionary with the value from the matrix, according to the key...
So the result would be:
[{11998: 2, 12114: 4},
{10085: 1, 9105: 8},
{6577: 2, 6491: 4},
{5286: 1, 6178: 8}]
If X is your sparse matrix and
D = [{11998: 0.27257158100079237, 12114: 0.27024630707640002},
{10085: 0.23909781233007368, 9105: 0.57533007741289421},
{6577: 0.45085059256989168, 6491: 0.5895717192325539},
{5286: 0.4482789582819417, 6178: 0.32295433881928487}]
then
for i, d in enumerate(D):
for j in d:
d[j] = X[i, j]
gives the desired result:
>>> D
[{12114: 4.0, 11998: 2.0}, {9105: 8.0, 10085: 1.0}, {6577: 2.0, 6491: 4.0}, {6178: 8.0, 5286: 1.0}]

Categories

Resources