Construct an assignment matrix - Python - python

I have two lists of element
a = [1,2,3,2,3,1,1,1,1,1]
b = [3,1,2,1,2,3,3,3,3,3]
and I am trying to uniquely match the element from a to b, my expected result is like this:
1: 3
2: 1
3: 2
So I tried to construct an assignment matrix and then use scipy.linear_sum_assignment
a = [1,2,3,2,3,1,1,1,1,1]
b = [3,1,2,1,2,3,3,3,3,3]
total_true = np.unique(a)
total_pred = np.unique(b)
matrix = np.zeros(shape=(len(total_pred),
len(total_true)
)
)
for n, i in enumerate(total_true):
for m, j in enumerate(total_pred):
matrix[n, m] = sum(1 for item in b if item==(i))
I expected the matrix to be:
1 2 3
1 0 2 0
2 0 0 2
3 6 0 0
But the output is:
[[2. 2. 2.]
[2. 2. 2.]
[6. 6. 6.]]
What mistake did I made in here? Thank you very much

You don't even need to process this by Pandas. try to use zip and dict:
In [42]: a = [1,2,3,2,3,1,1,1,1,1]
...: b = [3,1,2,1,2,3,3,3,3,3]
...:
In [43]: c =zip(a,b)
In [44]: dict(c)
Out[44]: {1: 3, 2: 1, 3: 2}
UPDATE as OP said, if we need to store all the value with the same key, we can use defaultdict:
In [58]: from collections import defaultdict
In [59]: d = defaultdict(list)
In [60]: for k,v in c:
...: d[k].append(v)
...:
In [61]: d
Out[61]: defaultdict(list, {1: [3, 3, 3, 3, 3, 3], 2: [1, 1], 3: [2, 2]})

This row:
matrix[n, m] = sum(1 for item in b if item==(i))
counts the occurrences of i in b and saves the result to matrix[n, m]. Each cell of the matrix will contain either the number of 1's in b (i.e. 2) or the number of 2's in b (i.e. 2) or the number of 3's in b (i.e. 6). Notice that this value is completely independent of j, which means that the values in one row will always be the same.
In order to take j into consideration, try to replace the row with:
matrix[n, m] = sum(1 for x, y in zip(a, b) if (x, y) == (j, i))

In case your expected output, since how we specify the matrix as a(i, j) with i is the index of the row, and j is the index of the col. Looking at a(3,1) in your matrix, the result is 6, which means (3,1) combination matches 6 times, with 3 is from b and 1 is from a. We can find all the matches from 2 list.
matches = [tuple([x, y]) for x,y in zip(b, a)]
Then we can find how many matches there are of a specific combination, for example a(3, 1).
result = matches.count((3,1))

Related

Fastest way to find the maximum minimum value of two 'connected' matrices

I want to maximize the following function:
f(i, j, k) = min(A(i, j), B(j, k))
Where A and B are matrices and i, j and k are indices that range up to the respective dimensions of the matrices. I would like to find (i, j, k) such that f(i, j, k) is maximized. I am currently doing that as follows:
import numpy as np
import itertools
shape_a = (100 , 150)
shape_b = (shape_a[1], 200)
A = np.random.rand(shape_a[0], shape_a[1])
B = np.random.rand(shape_b[0], shape_b[1])
# All the different i,j,k
combinations = itertools.product(np.arange(shape_a[0]), np.arange(shape_a[1]), np.arange(shape_b[1]))
combinations = np.asarray(list(combinations))
A_vals = A[combinations[:, 0], combinations[:, 1]]
B_vals = B[combinations[:, 1], combinations[:, 2]]
f = np.min([A_vals, B_vals], axis=0)
best_indices = combinations[np.argmax(f)]
print(best_indices)
[ 49 14 136]
This is faster than iterating over all (i, j, k), but a lot of (and most of the) time is spent constructing the A_vals and B_vals matrices. This is unfortunate, because they contain many many duplicate values as the same i, j and k appear multiple times. Is there a way to do this where (1) the speed of numpy's matrix computation can be preserved and (2) I don't have to construct the memory-intensive A_vals and B_vals arrays.
In other languages you could maybe construct the matrices so that they container pointers to A and B, but I do not see how to achieve this in Python.
Perhaps you could re-evaluate how you look at the problem in context of what min and max actually do. Say you have the following concrete example:
>>> np.random.seed(1)
>>> print(A := np.random.randint(10, size=(4, 5)))
[[5 8 9 5 0]
[0 1 7 6 9]
[2 4 5 2 4]
[2 4 7 7 9]]
>>> print(B := np.random.randint(10, size=(5, 3)))
[[1 7 0]
[6 9 9]
[7 6 9]
[1 0 1]
[8 8 3]]
You are looking for a pair of numbers in A and B such that the column in A is the same as the row of B, and the you get the maximum smaller number.
For any set of numbers, the largest pairwise minimum happens when you take the two largest numbers. You are therefore looking for the max in each column of A, row of B, the minimum of those pairs, and then the maximum of that. Here is a relatively simple formulation of the solution:
candidate_i = A.argmax(axis=0)
candidate_k = B.argmax(axis=1)
j = np.minimum(A[candidate_i, np.arange(A.shape[1])], B[np.arange(B.shape[0]), candidate_k]).argmax()
i = candidate_i[j]
k = candidate_k[j]
And indeed, you see that
>>> i, j, k
(0, 2, 2)
>>> A[i, j]
9
>>> B[j, k]
9
If there are collisions, argmax will always pick the first option.
Your values i,j,k are determined by the index of the maximum value from the set {A,B}. You can simply use np.argmax().
if np.max(A) < np.max(B):
ind = np.unravel_index(np.argmax(A),A.shape)
else:
ind = np.unravel_index(np.argmax(B),B.shape)
It will return only two values, either i,j if max({A,B}) = max({A}) or j,k if max({A,B}) = max({B}). But if for example you get i,j then k can be any value that fit the shape of the array B, so select randomly one of this value.
If you also need to maximize the other value then:
if np.max(A) < np.max(B):
ind = np.unravel_index(np.argmax(A),A.shape)
ind = ind + (np.argmax(B[ind[1],:]),)
else:
ind = np.unravel_index(np.argmax(B),B.shape)
ind = (np.argmax(A[:,ind[0]]),) + ind

How to get ranks from a sample in a list of values?

I'm new with Python and have a quite simple problem on paper but difficult to me in Python.
I have two samples of values (which are lists) :
X = [2, 2, 4, 6]
Y = [1, 3, 4, 5]
I have a concatenated list which is sorted as
Z = [ 1 , 2 , 2 , 3 , 4 , 4 , 5 , 6]
#rank: 1 2.5 4 5.5 7 8
I would like to get the sum of ranks of X values in Z. For this example, the ranks of 2, 2, 4 and 6 in Z are 2.5 + 2.5 + 5.5 + 8 = 18.5
(ranks of Y values in Z are 1 + 4 + 5.5 + 7 = 17.5)
Here is what I've done but it doesn't work with these lists X and Y (it works if each value appears only one time)
def funct(X, Z):
rank = []
for i in range(len(Z)):
for j in range(len(X)):
if Z[i] == X[j]:
rank = rank + [(i+1)]
print(sum(rank))
return
I would like to solve my problem with not too much complicated functions (only loops and quite easy ways to get a solution).
You can use a dictionary to keep track of the rank sums and counts once you've sorted the combined list.
X = [2, 2, 4, 6]
Y = [1, 3, 4, 5]
Z = sorted(X + Y)
ranksum = {}
counts = {}
for i, v in enumerate(Z):
ranksum[v] = ranksum.get(v, 0) + (i + 1) # Add
counts[v] = counts.get(v, 0) + 1 # Increment count
Then, when you want to look up the rank of an element, you need ranksum[v] / count[v].
r = [ranksum[x] / counts[x] for x in X]
print(r)
# Out: [2.5, 2.5, 5.5, 8]
Here's a solution for how to build the list of ranks:
X = ...
Y = ...
Z = sorted(X + Y)
rank = [1]
z = Z[:1]
for i, e in enumerate(Z[1:], start=2):
if e == z[-1]:
rank[-1] += 0.5
else:
rank.append(i)
z.append(e)
Now you can convert that into a dictionary:
ranks = dict(zip(z, rank))
That will make lookup easier:
sum(ranks[e] for e in X)
Here's another option where you build a dictionary of the rank indexes and then create a rank dictionary from there:
from collections import defaultdict
X = [2, 2, 4, 6]
Y = [1, 3, 4, 5]
Z = sorted(X + Y)
rank_indexes = defaultdict(lambda: [])
for i,v in enumerate(Z):
rank_indexes[v].append(i+1)
ranks = {k:(sum(v)/len(v)) for (k,v) in rank_indexes.items()}
print("Sum of X ranks:", sum([ranks[v] for v in X]))
print("Sum of Y ranks:", sum([ranks[v] for v in Y]))
Output:
Sum of X ranks: 18.5
Sum of Y ranks: 17.5
You can do the same thing without defaultdict, but it's slightly slower and I'd argue less Pythonic:
rank_indexes = {}
for i,v in enumerate(Z):
rank_indexes.setdefault(v, []).append(i+1)
ranks = {k:(sum(v)/len(v)) for (k,v) in rank_indexes.items()}

Indexing a numpy array based on order with repetition

I have a numpy array as follows,
arr = np.array([0.166667, 0., 0., 0.333333, 0., 0.166667, 0.166667, np.nan]
I wish to rank above array in descending order such that the highest value gets 1. and np.nan gets the last value but without incrementing the rank during value repetitions!
Expectation:
ranks = [2, 3, 3, 1, 3, 2, 2, 4]
i.e.
>>>>
1 0.333333
2 0.166667
2 0.166667
2 0.166667
3 0.0
3 0.0
3 0.0
4 -inf
What I have accomplished so far is below,
I used np.argsort twice and filled the np.nan value with the lowest float possible but the ranks increment even with the same value!
# The Logic
arr = np.nan_to_num(arr, nan=float('-inf'))
ranks = list(np.argsort(np.argsort(arr)[::-1]) + 1)
# Pretty Print
sorted_ = sorted([(r, a) for a, r, in zip(arr, ranks)], key=lambda v: v[0])
for r, a in sorted_:
print(r, a)
>>>>
1 0.333333
2 0.166667
3 0.166667
4 0.166667
5 0.0
6 0.0
7 0.0
8 -inf
Any idea on how to manage the ranks without increments?
https://repl.it/#MilindDalvi/MidnightblueUnselfishCategories
Here's a pandas approach using DataFrame.rank setting method="min" and na_option ='bottom':
s = pd.Series(arr).rank(method="min", na_option ='bottom', ascending=False)
u = np.sort(s.unique())
s.map(dict(zip(u, range(len(u))))).add(1).values
# array([2, 3, 3, 1, 3, 2, 2, 4], dtype=int64)
Try something like that before the last loop:
k = 1;
for i in (1, len(sorted_)):
if sorted_[i][1] != sorted_[i - 1][1] then
k = k + 1
sorted_[i][0] = k
Not necessarily a better way - just another way of approaching this issue
arr = sorted(np.array([0.166667, 0., 0., 0.333333, 0., 0.166667, 0.166667, np.nan]), reverse=True)
count = 1
mydict = {}
for a in arr:
if a not in mydict:
mydict[a] = count
count += 1
for i in arr:
print(mydict[i], i)
Here's one approach:
v = sorted(arr, reverse = 1)
for i,j in enumerate(set(v)):
if np.isnan(j): k = i+1
print([list(set(v)).index(i)+1 if not np.isnan(i) else k for i in arr])
Output
[2, 3, 3, 1, 3, 2, 2, 4]
numpy.unique sorts the unique values ascending, so using -arr gives you the correct order. The index for reversing this operation is exactly your rank (minus one).
arr_u, inv = np.unique(-arr, return_inverse=True)
rank = inv + 1

reduce lists given single value of 2d lists

I have 2 lists:
edges = [[0,1],[0,2],[0,3],[1,2],[1,3]]
weight = [10,8,7,3,7]
edges represents the list of edges connecting 2 nodes together with the corresponding weight.
for the given starting nodes as in edges[i][0] I want to choose the shortest connecting point given the weight so in this case the result would look like:
connect = [[0,3],[1,2]]
weight = [7,3]
Because out of all the nodes connected to 0 3 is the closest one and for 1, 2 is the closest one.
I am not able to formulate the problem, any help is appreciated!
edges = [[0,1],[0,2],[0,3],[1,2],[1,3]]
weight = [10,8,7,3,7]
connect = []
wght = []
In [8]: for i in set(e[0] for e in edges):
...: temp = [(a, b) for a, b in zip(edges, weight) if a[0] == i]
...: temp = min(temp, key=lambda x: x[1])
...: connect += [temp[0]]
...: wght += [temp[1]]
In [9]: connect
Out[9]: [[0, 3], [1, 2]]
In [10]: wght
Out[10]: [7, 3]
In case you are into one liner:
In [20]: [min([(a, b) for a, b in zip(edges, weight) if a[0] == i], key=lambda x: x[1]) fo
...: r i in set([e[0] for e in edges])]
Out[20]: [([0, 3], 7), ([1, 2], 3)]
Another solution using Pandas:
df = pd.DataFrame(edges, columns=['start','end'])
df['weight'] = weight
df.loc[df.groupby('start')['weight'].idxmin()]
With the results being:
start end weight
0 3 7
1 2 3

how to extract index in factor vector in Rpy2

I have a factor vector sv='ababbc' and a integer vector fv=[1,1,1,1,1,1]. fv is correspond to sv.
import rpy2.robjects as robjects
sv=robjects.StrVector('ababbc')
fac=robjects.FactorVector(sv)
fv=robjects.r['rep'](1,6)
I want to change the value of element to 2 in fv, which of index correspond to letter “a”.
made fv=[2,1,2,1,1,1]
How to do it? Thank you.
To get the index when true:
In [54]:
import numpy as np
np.argwhere(np.array(sv) == 'a')
Out[54]:
array([[0],
[2]])
The 1st and 3rd positions have the letter 'a'.
You can't do that with fac, as it is already factorized and contains only the levels, 1, 2, 3..., not the original 'a', 'b', 'c'... anymore.
In [55]:
np.argwhere(np.array(fac) == 'a')
Out[55]:
array([], shape=(0, 1), dtype=int64)
In [56]:
np.array(fac)
Out[56]:
array([1, 2, 1, 2, 2, 3], dtype=int32)
Or it can be done in R side:
In [51]:
robjects.reval('result1 <- which(sv %in% c("a"))')
print robjects.r.result1
[1] 1 3
To systematically assign a given value to a level, I suggest you to use the factor function in R:
In [53]:
robjects.r.assign('sv', sv)
robjects.reval('result3 <- factor(sv, levels=c("a","b","c"), labels=c(10,2,3))')
print robjects.r.result3
[1] 10 2 10 2 2 3
Levels: 10 2 3
So a gets 10, b gets 2, c gets 3 and so on.

Categories

Resources