Python - sparse vectors/distance calculation - python

I'm looking for dynamically growing vectors in Python, since I don't know their length in advance. In addition, I would like to calculate distances between these sparse vectors, preferably using the distance functions in scipy.spatial.distance (although any other suggestions are welcome). Any ideas how to do this? (Initially, it doesn't need to be efficient.)
Thanks a lot in advance!

You can use regular python lists (which are dynamic) as vectors. Trivial example follows.
from scipy.spatial.distance import sqeuclidean
a = [1,2,3]
b = [0,0,0]
print sqeuclidean(a,b) # 14
As per aganders3's suggestion, do note that you can also use numpy arrays if needed:
import numpy
a = numpy.array([1,2,3])
If the sparse part of your question is crucial I'd use scipy for that - it has support for sparse matrixes. You can define a 1xn matrix and use it as a vector. This works (the parameter is the size of the matrix, filled with zeroes by default):
sqeuclidean(scipy.sparse.coo_matrix((1,3)),scipy.sparse.coo_matrix((1,3))) # 0
There are many kinds of sparse matrixes, some dictionary based (see comment). You can define a row sparse matrix from a list like this:
scipy.sparse.csr_matrix([1,2,3])

Here is how you can do it in numpy:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([0, 0, 0])
c = np.sum(((a - b) ** 2)) # 14

Related

Python (NumPy): Memory efficient array multiplication with fancy indexing

I'm looking to do fast matrix multiplication in python, preferably NumPy, of an array A with another array B of repeated matrices by using a third array I of indices. This can be accomplished using fancy indexing and matrix multiplication:
from numpy.random import rand, randint
A = rand(1000,5,5)
B = rand(40000000,5,1)
I = randint(low=0, high=1000, size=40000000)
A[I] # B
However, this creates the intermediate array A[I] of shape (40000000, 5, 5) which overflows the memory. It seems highly inefficient to have to repeat a small set of matrices for multiplication, and this is essentially a more general version of broadcasting such as A[0:1] # B which has no issues.
Are there any alternatives?
I have looked at NumPy's einsum function but have not seen any support for utilizing an index vector in the call.
If you're open to another package, you could wrap it up with dask.
from numpy.random import rand, randint
from dask import array as da
A = da.from_array(rand(1000,5,5))
B = da.from_array(rand(40000000,5,1))
I = da.from_array(randint(low=0, high=1000, size=40000000))
fancy = A[I] # B
After finished manipulating, then bring it into memory using fancy.compute()

How to use a sparse matrix in numpy.linalg.solve

I want to solve the following linear system for x
Ax = b
Where A is sparse and b is just regular column matrix. However when I plug into the usual np.linalg.solve(A,b) routine it gives me an error. However when I do np.linalg.solve(A.todense(),b) it works fine.
Question.
How can I use this linear solve still preserving the sparseness of A?. The reason is A is quite large about 150 x 150 and there are about 50 such matrices and so keeping it sparse for as long as possible is the way I'd prefer it.
I hope my question makes sense. How should I go about achieving this?
Use scipy instead to work on sparse matrices.You can do that using scipy.sparse.linalg.spsolve. For further details read its documentation spsolve
np.linalg.solve only works for array-like objects. For example it would work on a np.ndarray or np.matrix (Example from the numpy documentation):
import numpy as np
a = np.array([[3,1], [1,2]])
b = np.array([9,8])
x = np.linalg.solve(a, b)
or
import numpy as np
a = np.matrix([[3,1], [1,2]])
b = np.array([9,8])
x = np.linalg.solve(a, b)
or on A.todense() where A=scipy.sparse.csr_matrix(np.matrix([[3,1], [1,2]])) as this returns a np.matrix object.
To work with a sparse matrix, you have to use scipy.sparse.linalg.spsolve (as already pointed out by rakesh)
import numpy as np
import scipy.sparse
import scipy.sparse.linalg
a = scipy.sparse.csr_matrix(np.matrix([[3,1], [1,2]]))
b = np.array([9,8])
x = scipy.sparse.linalg.spsolve(a, b)
Note that x is still a np.ndarray and not a sparse matrix. A sparse matrix will only be returned if you solve Ax=b, with b being a matrix and not a vector.

python numpy vector math

What is the numpy equivalent to euclid's 2d vector classes / operations ? ( like: euclid.Vector2 )
So far I have this. Create two vectors
import numpy as np
loc = np.array([100., 100.])
vel = np.array([30., 10])
loc += vel
# reseting speed to a default value, maintaining direction
vel.normalize()
vel *= 200
loc += vel
You can just use numpy arrays. Look at the numpy for matlab users page for a detailed overview of the pros and cons of arrays w.r.t. matrices.
As I mentioned in the comment, having to use the dot() function or method for mutiplication of vectors is the biggest pitfall. But then again, numpy arrays are consistent. All operations are element-wise. So adding or subtracting arrays and multiplication with a scalar all work as expected of vectors.
Edit2: Starting with Python 3.5 and numpy 1.10 you can use the # infix-operator for matrix multiplication, thanks to pep 465.
Edit: Regarding your comment:
Yes. The whole of numpy is based on arrays.
Yes. linalg.norm(v) is a good way to get the length of a vector. But what you get depends on the possible second argument to norm! Read the docs.
To normalize a vector, just divide it by the length you calculated in (2). Division of arrays by a scalar is also element-wise.
An example in ipython:
In [1]: import math
In [2]: import numpy as np
In [3]: a = np.array([4,2,7])
In [4]: np.linalg.norm(a)
Out[4]: 8.3066238629180749
In [5]: math.sqrt(sum([n**2 for n in a]))
Out[5]: 8.306623862918075
In [6]: b = a/np.linalg.norm(a)
In [7]: np.linalg.norm(b)
Out[7]: 1.0
Note that In [5] is an alternative way to calculate the length. In [6] shows normalizing the vector.

Efficient two dimensional numpy array statistics

I have many 100x100 grids, is there an efficient way using numpy to calculate the median for every grid point and return just one 100x100 grid with the median values? Presently, I'm using a for loop to run through each grid point, calculating the median and then combining them into one grid at the end. I'm sure there's a better way to do this using numpy. Any help would be appreciated! Thanks!
Create as 100x100xN array (or stack together if that's not possible) and use np.median with the correct axis to do it in one go:
import numpy as np
a = np.random.rand(100,100)
b = np.random.rand(100,100)
c = np.random.rand(100,100)
d = np.dstack((a,b,c))
result = np.median(d,axis=2)
How many grids are there?
One option would be to create a 3D array that is 100x100xnumGrids and compute the median across the 3rd dimension.
use axis parameter of median:
import numpy as np
data = np.random.rand(100, 5, 5)
print np.median(data, axis=0)
print np.median(data[:, 0, 0])
print np.median(data[:, 1, 0])

pointwise operations on scipy.sparse matrices

Is it possible to apply for example numpy.exp or similar pointwise operators to all elements in a scipy.sparse.lil_matrix or another sparse matrix format?
import numpy
from scipy.sparse import lil_matrix
x = numpy.ones((10,10))
y = numpy.exp(x)
x = lil_matrix(numpy.ones((10,10)))
# y = ????
numpy.exp(x) or scipy.exp(x) yields an AttributeError, and numpy.exp(x.data) yields the same.
thanks!
I do not know the full details, but converting to another type works, at least when using the array of non zero elements:
xcsc = x.tocsc()
numpy.exp(xcsc.data) # works

Categories

Resources