Best way to calculate similarity between rows of a matrix - python

I'm a Python noob, so sorry if the question sounds too basic for anyone reading.
I've this matrix of type numpy.matrix (I omit all the code that generates this matrix):
[[0. 0.2342598 0. 0. 0. 0.31308172
0. 0. 0. 0. 0.31308172 0.
0. 0.86549525 0. ]
[0. 0.2342598 0. 0. 0. 0.31308172
0. 0. 0. 0. 0.31308172 0.
0. 0.86549525 0. ]
[0.22575551 0.72375361 0. 0.19345532 0.22575551 0.19345532
0. 0.38691064 0.19345532 0.19345532 0.19345532 0.19345532
0. 0. 0. ]
[0.22575551 0.72375361 0. 0.19345532 0.22575551 0.19345532
0. 0.38691064 0.19345532 0.19345532 0.19345532 0.19345532
0. 0. 0. ]
[0. 0.64936739 0. 0.28928716 0. 0.
0.39985833 0.28928716 0.28928716 0.28928716 0. 0.28928716
0. 0. 0. ]
[0.26302218 0.50593649 0.37991833 0.22539002 0.26302218 0.
0.31153847 0.11269501 0.11269501 0.45078005 0. 0.11269501
0.18995916 0. 0.18995916]]
By using sklearn.metrics.pairwise.cosine_similarity I easily get the similarity between two rows.
For example, by coding cosine_similarity(X[0], X[1]) I get a numpy.ndarray that contains only one element: a float value between 0.0 and 1.0 that represents the level of similarity between X[0] and X[1]. I finally get the value inside the array this way cosine_similarity(X[0], X[1])[0][0].item().
Problem is I don't need to compare only two rows for similarity: I need to compare X[0] to any other row and find the most similar to X[0].
What's the best (more pythonic, performant, elegant, practical...) way to do it?
Any help is appreciated.
Update: sorry, I forgot to mention what actually works for me:
def calculate():
h = 0.0
e = -1
for i in range(1, len(m)):
if cosine_similarity(m[0], m[i])[0][0].item() >= h:
h = cosine_similarity(m[0], m[i])[0][0].item()
e = i
return e

Related

Is there a python function for assigning values to several elements in a list?

I used the code shown below to create a list of lists.
Code:
num = 782
sol=4
pop_size= [sol, num]
initial_population_1 = np.random.uniform(low=0.0, high=0.0, size=pop_size)
The list of lists is shown below:
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
How can I randomly assign five values that are greater than 0 but less than 10 to five elements in each list in the list?
Thank you very much!
So, you have a list of lists, specifically a list of 4 lists, each of them containing 782 elements, all 0.0, and you want to set 5 elements at random to 1.0.
I'd like to mention that, as you are using Numpy, there is np.zeros(shape) that provides you with a zero-filled array, but whatever…
From your question it's not clear if you want to avoid to use twice the same location, but let's assume that you want to assign a random value to exactly 5 entries in each row
for row in initial_population_1:
locations_used_in_this_row = 0
while locations_used_in_this_row != 5:
column = np.random.randint(num)
if row[column] == 0.0:
row[column] = np.random.rand()*10
locations_used_in_this_row += 1

How to create a rectangular grid with custom start point and step value

I'm working on a project where I need to calibrate to cameras. As you know one needs to define a plane grid points in the 3D-world and find their correspondences on the image plane. Therefore, the first camera has the following 3D_grid points:
mport cv2 as cv
import numpy as np
WPoints_cam1 = np.zeros((9*3,3), np.float64)
WPoints_cam1[:,:2] = np.mgrid[0:9,0:3].T.reshape(-1,2)*0.4
print(WPoints_cam1)
[[0. 0. 0. ]# world coordinate center
[0.4 0. 0. ]
[0.8 0. 0. ]
[1.2 0. 0. ]
[1.6 0. 0. ]
[2. 0. 0. ]
[2.4 0. 0. ]
[2.8 0. 0. ]
[3.2 0. 0. ]
[0. 0.4 0. ]
[0.4 0.4 0. ]
[0.8 0.4 0. ]
[1.2 0.4 0. ]
[1.6 0.4 0. ]
[2. 0.4 0. ]
[2.4 0.4 0. ]
[2.8 0.4 0. ]
[3.2 0.4 0. ]
[0. 0.8 0. ]
[0.4 0.8 0. ]
[0.8 0.8 0. ]
[1.2 0.8 0. ]
[1.6 0.8 0. ]
[2. 0.8 0. ]
[2.4 0.8 0. ]
[2.8 0.8 0. ]
[3.2 0.8 0. ]]
As seen above the first grid (for the first camera) starts from the defined reference 3D_point (0,0,0) and ends by the point (3.2,0.8 0) with a constant offset of 0.4 and 9x3 dimension
Note that all Z coordinates were put to Z=0 (Zhengyou Zhang calibration)
Now my question is, as I need to define a second grid(for the second camera) that also refers to the defined 3D_coordinate center (0,0,0), I need to define a grid that starts from (3.6,0,0) and ends with (6.8,0.8,0) with the same offset 0.4 and has a dimension 9x3
I believe this is easy to do. However I can't think out of the box due to my beginner level of experience.
Would appreciate for some help and thanks in advance.
You can scale each column like this:
np.mgrid[0:8, 0:3].T.reshape(-1,2) * np.array([(7.8 - 3.6) / 7, 0.4]) + np.array([3.6, 0])
or combine it into scaling matrix like this (and then add on a vector for the translation)
np.mgrid[0:8, 0:3].T.reshape(-1,2) # np.array([[(7.8 - 3.6) / 7, 0], [0, 0.4]]).T + np.array([3.6, 0])
regarding where (7.8 - 3.6) / 7 comes from, the numerator should be self evident. The denominator is the same but for your original dimensions. With 0:8 the max is 7 and the min is 0 so the denominator becomes 7 - 0.

Odd behavior of using += with numpy.array and numpy.ma.array

Can anyone explain the following result to me?
I know it is not as one would usually do this operation, but I found this result odd.
import numpy as np
a = np.ma.masked_where(np.arange(20)>10,np.arange(20))
b = np.ma.masked_where(np.arange(20)>-1,np.arange(20))
c = np.zeros(a.shape)
d = np.zeros(a.shape)
c[~a.mask] += b[~a.mask]
print(b[~a.mask])
#masked_array(data=[--, --, --, --, --, --, --, --,--, --, --],
# mask=[ True, True, True, True, True, True, True, True, True, True, True],
# fill_value=999999,
# dtype=int64)
print(c)
#[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
d[~a.mask] = d[~a.mask] + b[~a.mask]
print(d)
#[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
I expected c to not change, but I guess there is something related to objects in memory going on here. Also, += keeps the original object, while = and + creates a new d.
I just don't really understand where the data comes from that's added to c.
I will start with a simpler example for better understanding:
b = np.ma.masked_where(np.arange(20)>-1,np.arange(20))
#b: [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]
#b.data: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
c = np.zeros(b.shape)
#c: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
d = np.zeros(b.shape)
#d: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
c += b
#c: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
d = d + b
#d: [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]
#d.data: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
The first operation c += b is an in-place operation. In other words, it is equivalent to c = type(c).__iadd__(c, b) which does the addition according to type of c, which is not a masked array, hence the data of b used as unmasked.
On the other hand, d = d + b is equivalent to d = np.MaskedArray.__add__(d, b) (to be more particular, since masked arrays are a subclass of ndarrays, it uses __radd__) and is NOT an in-place assignment. This means it creates a new object and uses the wider type on the right hand side of the equation when adding and hence converts d (which is an unmasked array) to a masked array (because b is a masked array), therefore the addition uses valid values only (which in this case there is none since ALL elements of b are masked and invalid). This results in a masked array d with same mask as b while the data of d remains unchanged.
This difference in behavior is not Numpy specific and applies to python itself too. The case mentioned in the question by OP has similar behavior, and as #alaniwi mentioned in the comments, the boolean indexing with mask a is not fundamental to the behavior. Using a to mask elements of b, c, and d is only limiting the assignment to masked elements by a (rather than all elements of arrays) and nothing more.
To makes things a bit more interesting and in fact clearer, lets switch the places of b and d on the right hand side:
e = np.zeros(b.shape)
#e: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
e = b + e
#e: [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --]
#e.data: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Note that, similar to d = d + b, the right hand side uses masked array __add__ function, so the output is a masked array, but since you are adding e to b (a.k.a e = np.MaskedArray.__add__(b, e)), the masked data of b is returned, while in d = d + b, you are adding b to d and data of d is returned.

How to calculate formula for every value in an array?

Im trying to get to understand how to use numpy for calculating a formula for different times. The way the code is written gives all the values where y is bigger than 0. I am experimenting how to get the values for all y's.
Is there someone who can explain me the part: ft = t * [y >= 0.0 ]. How do i use the parts within the brackets?
from numpy import *
g = 10.0
h0 = 10.0
t = arange(0, 10.1 ,0.1)
y = h0 - 0.5*g*t*t
ft = t * [y >= 0.0 ]
print(ft)
This is the output, but I would like to see all the values calculated. So i experimented a bit but i could not figure it out how to do it and how the; [y >= 0.0] part exactly works.
[[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
If i use [y] instead of [y >= 0.0] i get the following:
[[ 0.000000e+00 9.950000e-01 1.960000e+00 2.865000e+00 3.680000e+00
4.375000e+00 4.920000e+00 5.285000e+00 5.440000e+00 5.355000e+00
5.000000e+00 4.345000e+00 3.360000e+00 2.015000e+00 2.800000e-01
-1.875000e+00 -4.480000e+00 -7.565000e+00 -1.116000e+01 -1.529500e+01
-2.000000e+01 -2.530500e+01 -3.124000e+01 -3.783500e+01 -4.512000e+01
-5.312500e+01 -6.188000e+01 -7.141500e+01 -8.176000e+01 -9.294500e+01
-1.050000e+02 -1.179550e+02 -1.318400e+02 -1.466850e+02 -1.625200e+02
-1.793750e+02 -1.972800e+02 -2.162650e+02 -2.363600e+02 -2.575950e+02
-2.800000e+02 -3.036050e+02 -3.284400e+02 -3.545350e+02 -3.819200e+02
-4.106250e+02 -4.406800e+02 -4.721150e+02 -5.049600e+02 -5.392450e+02
-5.750000e+02 -6.122550e+02 -6.510400e+02 -6.913850e+02 -7.333200e+02
-7.768750e+02 -8.220800e+02 -8.689650e+02 -9.175600e+02 -9.678950e+02
-1.020000e+03 -1.073905e+03 -1.129640e+03 -1.187235e+03 -1.246720e+03
-1.308125e+03 -1.371480e+03 -1.436815e+03 -1.504160e+03 -1.573545e+03
-1.645000e+03 -1.718555e+03 -1.794240e+03 -1.872085e+03 -1.952120e+03
-2.034375e+03 -2.118880e+03 -2.205665e+03 -2.294760e+03 -2.386195e+03
-2.480000e+03 -2.576205e+03 -2.674840e+03 -2.775935e+03 -2.879520e+03
-2.985625e+03 -3.094280e+03 -3.205515e+03 -3.319360e+03 -3.435845e+03
-3.555000e+03 -3.676855e+03 -3.801440e+03 -3.928785e+03 -4.058920e+03
-4.191875e+03 -4.327680e+03 -4.466365e+03 -4.607960e+03 -4.752495e+03
-4.900000e+03]]
I would like to know how i can use numpy to calculate at once all the outcomes of a formula for different time intervals.
Thanks,
y >= 0.0 gives you an array of Booleans which contain True/False depending on the fulfillment of the condition y >= 0.0. When you enclose it within [] as [y >= 0.0], you get a list which contains a single array of Booleans, as pointed out by #nicola in the comments below.
[array([ True, True, True, True, True, False, False, False,...
... False, False, False, False])]
Now you multiply this with your arange array which will give you 0 when the right hand side of * operator is False and will give you the actual value from the arange when the right hand side of * operator is True
The array [y >= 0.0] produces and array of booleans. i.e. 1 if y>=0 and 0 if not. That array of 1's and 0's is then multiplied by t.
It is not clear to me from your question however, what you are trying to do with it.

Theano: how to efficiently undo/reverse max-pooling

I'm using Theano 0.7 to create a convolutional neural net which uses max-pooling (i.e. shrinking a matrix down by keeping only the local maxima).
In order to "undo" or "reverse" the max-pooling step, one method is to store the locations of the maxima as auxiliary data, then simply recreate the un-pooled data by making a big array of zeros and using those auxiliary locations to place the maxima in their appropriate locations.
Here's how I'm currently doing it:
import numpy as np
import theano
import theano.tensor as T
minibatchsize = 2
numfilters = 3
numsamples = 4
upsampfactor = 5
# HERE is the function that I hope could be improved
def upsamplecode(encoded, auxpos):
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
for whichitem in range(minibatchsize):
for whichfilt in range(numfilters):
upsampled = T.set_subtensor(upsampled[whichitem, whichfilt, auxpos[whichitem, whichfilt, :]], encoded[whichitem, whichfilt, :])
return upsampled
totalitems = minibatchsize * numfilters * numsamples
code = theano.shared(np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)))
auxpos = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)) % upsampfactor # arbitrary positions within a bin
auxpos += (np.arange(4) * 5).reshape((1,1,-1)) # shifted to the actual temporal bin location
auxpos = theano.shared(auxpos.astype(np.int))
print "code:"
print code.get_value()
print "locations:"
print auxpos.get_value()
get_upsampled = theano.function([], upsamplecode(code, auxpos))
print "the un-pooled data:"
print get_upsampled()
(By the way, in this case I have a 3D tensor, and it's only the third axis that gets max-pooled. People who work with image data might expect to see two dimensions getting max-pooled.)
The output is:
code:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
locations:
[[[ 0 6 12 18]
[ 4 5 11 17]
[ 3 9 10 16]]
[[ 2 8 14 15]
[ 1 7 13 19]
[ 0 6 12 18]]]
the un-pooled data:
[[[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 2. 0.
0. 0. 0. 0. 3. 0.]
[ 0. 0. 0. 0. 4. 5. 0. 0. 0. 0. 0. 6. 0. 0.
0. 0. 0. 7. 0. 0.]
[ 0. 0. 0. 8. 0. 0. 0. 0. 0. 9. 10. 0. 0. 0.
0. 0. 11. 0. 0. 0.]]
[[ 0. 0. 12. 0. 0. 0. 0. 0. 13. 0. 0. 0. 0. 0.
14. 15. 0. 0. 0. 0.]
[ 0. 16. 0. 0. 0. 0. 0. 17. 0. 0. 0. 0. 0. 18.
0. 0. 0. 0. 0. 19.]
[ 20. 0. 0. 0. 0. 0. 21. 0. 0. 0. 0. 0. 22. 0.
0. 0. 0. 0. 23. 0.]]]
This method works but it's a bottleneck, taking most of my computer's time (I think the set_subtensor calls might imply cpu<->gpu data copying). So: can this be implemented more efficiently?
I suspect there's a way to express this as a single set_subtensor() call which may be faster, but I don't see how to get the tensor indexing to broadcast properly.
UPDATE: I thought of a way of doing it in one call, by working on the flattened tensors:
def upsamplecode2(encoded, auxpos):
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
add_to_flattened_indices = theano.shared(np.array([ [[(y + z * numfilters) * numsamples * upsampfactor for x in range(numsamples)] for y in range(numfilters)] for z in range(minibatchsize)], dtype=theano.config.floatX).flatten(), name="add_to_flattened_indices")
upsampled = T.set_subtensor(upsampled.flatten()[T.cast(auxpos.flatten() + add_to_flattened_indices, 'int32')], encoded.flatten()).reshape(upsampled.shape)
return upsampled
get_upsampled2 = theano.function([], upsamplecode2(code, auxpos))
print "the un-pooled data v2:"
ups2 = get_upsampled2()
print ups2
However, this is still not good efficiency-wise because when I run this (added on to the end of the above script) I find out that the Cuda libraries can't currently do the integer index manipulation efficiently:
ERROR (theano.gof.opt): Optimization failure due to: local_gpu_advanced_incsubtensor1
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/theano/gof/opt.py", line 1493, in process_node
replacements = lopt.transform(node)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/opt.py", line 952, in local_gpu_advanced_incsubtensor1
gpu_y = gpu_from_host(y)
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 507, in __call__
node = self.make_node(*inputs, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/basic_ops.py", line 133, in make_node
dtype=x.dtype)()])
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/type.py", line 69, in __init__
(self.__class__.__name__, dtype, name))
TypeError: CudaNdarrayType only supports dtype float32 for now. Tried using dtype int64 for variable None
I don't know whether this is faster, but it may be a little more concise. See if it is useful for your case.
import numpy as np
import theano
import theano.tensor as T
minibatchsize = 2
numfilters = 3
numsamples = 4
upsampfactor = 5
totalitems = minibatchsize * numfilters * numsamples
code = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples))
auxpos = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)) % upsampfactor
auxpos += (np.arange(4) * 5).reshape((1,1,-1))
# first in numpy
shp = code.shape
upsampled_np = np.zeros((shp[0], shp[1], shp[2] * upsampfactor))
upsampled_np[np.arange(shp[0]).reshape(-1, 1, 1), np.arange(shp[1]).reshape(1, -1, 1), auxpos] = code
print "numpy output:"
print upsampled_np
# now the same idea in theano
encoded = T.tensor3()
positions = T.tensor3(dtype='int64')
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
upsampled = T.set_subtensor(upsampled[T.arange(shp[0]).reshape((-1, 1, 1)), T.arange(shp[1]).reshape((1, -1, 1)), positions], encoded)
print "theano output:"
print upsampled.eval({encoded: code, positions: auxpos})

Categories

Resources