I have two sparse matrices in pythons sparse package. See below:
import sparse
total_coords1 = [(0,1,1,2), (0,0,2,3), (0,1,2,2)]
data1 = [1,1,1,1]
s1 = sparse.COO(total_coords1, data1, shape=(7, 5, 12))
total_coords2 = [(0,1,2,3), (0,1,1,2), (0,1,2,2)]
data2 = [2,2,2,2]
s2 = sparse.COO(total_coords1, data1, shape=(7, 5, 15))
I want to combine these two sparse matrices into a single sparse matrix along the last axis (axis=2). something like:
s3 = sparse.COO(s1, s2)
Since you did not mention the axis along which you want to concatenate, I will assume axis=2, as it is the only possible axis along which we can concatenate the given arrays.
You can use concatenate function to get a single sparse matrix of shape (7, 5, 27):
s3 = sparse.concatenate([s1,s2], axis=2)
Related
I want to perform a elementwise-multiplication of two (scipy) sparse matrices: A.shape = B.shape = (m,n). However, matrix B consists of a smaller matrix B_base which is stacked horizontally. Obviously, this is is not memory-efficient. Thus, the question: How can I efficiently multiply A and B_base elementwise without stacking?
Below find a MWE using sparse.hstack:
from scipy import sparse
A = sparse.random(m=1000, n=10000, density=0.1, format="csc")
B = sparse.random(m=1000, n=1000, density=0.1, format="csc")
factor_matrix = sparse.hstack([B for i in range(10)], format="csc")
result = A.multiply(factor_matrix)
Let's suppose I have these two variables
matrices = np.random.rand(4,3,3)
vectors = np.random.rand(4,3,1)
What I would like to perform is the following:
dot_products = [matrix # vector for (matrix,vector) in zip(matrices,vectors)]
Therefore, I've tried using the np.tensordot method, which at first seemed to make sense, but this happened when testing
>>> np.tensordot(matrices,vectors,axes=([-2,-1],[-2,-1]))
...
ValueError: shape-mismatch for sum
>>> np.tensordot(matrices,vectors,axes=([-2,-1]))
...
ValueError: shape-mismatch for sum
Is it possible to achieve these multiple dot products with the mentioned Numpy method? If not, is there another way that I can accomplish this using Numpy?
The documentation for # is found at np.matmul. It is specifically designed for this kind of 'batch' processing:
In [76]: matrices = np.random.rand(4,3,3)
...: vectors = np.random.rand(4,3,1)
In [77]: dot_products = [matrix # vector for (matrix,vector) in zip(matrices,vectors)]
In [79]: np.array(dot_products).shape
Out[79]: (4, 3, 1)
In [80]: (matrices # vectors).shape
Out[80]: (4, 3, 1)
In [81]: np.allclose(np.array(dot_products), matrices#vectors)
Out[81]: True
A couple of problems with tensordot. The axes parameter specify which dimensions are summed, "dotted", In your case it would be the last of matrices and 2nd to the last of vectors. That's the standard dot paring.
In [82]: np.dot(matrices, vectors).shape
Out[82]: (4, 3, 4, 1)
In [84]: np.tensordot(matrices, vectors, (-1,-2)).shape
Out[84]: (4, 3, 4, 1)
You tried to specify 2 pairs of axes for summing. Also dot/tensordot does a kind of outer product on the other dimensions. You'd have to take the "diagonal" on the 4's. tensordot is not what you want for this operation.
We can be more explicit about the dimensions with einsum:
In [83]: np.einsum('ijk,ikl->ijl',matrices, vectors).shape
Out[83]: (4, 3, 1)
I have a problem in python where i would like to merge some sparse matrices into one. The sparse matrices are of csr_matrix type and have same amount of rows. When I use hstack to stack them together I obtain an array of matrices, but I would like to obtain a single matrix with the number of rows (which is the same for every matrix) and as the number of columns the sum of the columns number of every matrix.
Thanks for support.
You can do this using scipy.sparse.hstack. For example:
import numpy as np
from scipy import sparse
x = sparse.csr_matrix(np.random.randint(0, 2, size=(10, 10)))
y = sparse.csr_matrix(np.random.randint(0, 2, size=(10, 10)))
xy = sparse.hstack([x, y])
print(xy.shape)
# (10, 20)
print(type(xy))
# <class 'scipy.sparse.coo.coo_matrix'>
i have an array y with shape (n,), I want to compute the inner product matrix, which is a n * n matrix
However, when I tried to do it in Python
np.dot(y , y)
I got the answer n, this is not what I am looking for
I have also tried:
np.dot(np.transpose(y),y)
np.dot(y, np.transpose(y))
I always get the same answer n
I think you are looking for:
np.multiply.outer(y,y)
or equally:
y = y[None,:]
y.T#y
example:
y = np.array([1,2,3])[None,:]
output:
#[[1 2 3]
# [2 4 6]
# [3 6 9]]
You can try to reshape y from shape (70,) to (70,1) before multiplying the 2 matrices.
# Reshape
y = y.reshape(70,1)
# Either below code would work
y*y.T
np.matmul(y,y.T)
One-liner?
np.dot(a[:, None], a[None, :])
transpose doesn't work on 1-D arrays, because you need atleast two axes to 'swap' them. This solution adds a new axis to the array; in the first argument, it looks like a column vector and has two axes; in the second argument it still looks like a row vector but has two axes.
Looks like what you need is the # matrix multiplication operator. dot method is only to compute dot product between vectors, what you want is matrix multiplication.
>>> a = np.random.rand(70, 1)
>>> (a # a.T).shape
(70, 70)
UPDATE:
Above answer is incorrect. dot does the same things if the array is 2D. See the docs here.
np.dot computes the dot product of two arrays. Specifically,
If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a # b is preferred.
Simplest way to do what you want is to convert the vector to a matrix first using np.matrix and then using the #. Although, dot can also be used # is better because conventionally dot is used for vectors and # for matrices.
>>> a = np.random.rand(70)
(70,)
>>> a.shape
>>> a = np.matrix(a).T
>>> a.shape
(70, 1)
>>> (a # a.T).shape
(70, 70)
I have a numpy array which has 100 rows and 16026 columns. I have to find the median of every column. So median for every column will be calculated from 100 observations (100 rows in this case). I am using the following code to achieve this:
for category in categories:
indices = np.random.randint(0, len(os.listdir(filepath + category)) - 1, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
And here's the output that I get:
(100, 16026)
(100, 16026)
(100, 16026)
(100, 16026)
My question is - why is the shape of medArray 100*16026 and not 1*16026? Because I am calculating the median of every column, I would expect only one row with 16026 columns. What am I missing here?
Please note that X_train is a sparse matrix.
X_train.shape
output:
(2034, 16026)
Any help in this regard is much appreciated.
Edit:
The above problem has been solved by toarray() function.
tempArray = X_train[indices, ].toarray()
I also figured that I was being stupid and also including all the zeroes in my median calculation and that's why I was getting 0 as the median all the time. Is there an easy way of calculating the median by removing/ignoring the zeroes across all the columns?
That's really strange, I think you should get (16026,), are we missing something here:
In [241]:
X_train=np.random.random((1000,16026)) #1000 can be any int.
indices = np.random.randint(0, 60, 100) #60 can be any int.
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
(16026,)
And the only way you can get a 2d array result is:
In [243]:
X_train=np.random.random((100,2,16026))
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
(2, 16026)
When you have a 3d array input.
When it is a sparse array, a dumb way to get around this might be:
In [319]:
X_train = sparse.rand(112, 16026, 0.5, 'csr') #just make up a random sparse array
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray.toarray(), axis=0)
print(medArray.shape)
(16026,)
.toarray() might also go to the 3rd line instead. But either way, this means the 0's are also counted as #zhangxaochen pointed out.
Out of ideas, there may be better explanations for it.
The problem is that NumPy doesn't recognize sparse matrices as arrays or array-like objects. For example, calling asanyarray on a sparse matrix returns a 0D array whose one element is the original sparse matrix:
In [8]: numpy.asanyarray(scipy.sparse.csc_matrix([[1,2,3],[4,5,6]]))
Out[8]:
array(<2x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>, dtype=object)
Like most of NumPy, numpy.median relies on having an array or array-like object as input. The routines it relies on, particularly sort, won't understand what they're looking at if you give it a sparse matrix.
I was finally able to solve this. I used masked arrays and the following code:
sample = []
sample_size = 50
idx = matplotlib.mlab.find(newsgroups_train.target==i)
random_index = []
for j in range(sample_size):
random_index.append(randrange(0,len(idx)-1))
y = np.ma.masked_where(X_train[sample[0]].toarray() == 0, X_train[sample[0]].toarray())
medArray = np.ma.median(y, axis=0).filled(0)
print '============median ' + newsgroups_train.target_names[i] + '============='
for k,word in enumerate(np.array(vectorizer.get_feature_names())[np.argsort(medArray)[::-1][0:10]]):
print word + ':' + str(np.sort(medArray)[::-1][k])
This gave me the median ignoring zeros.