thanks for taking the time to look at my question.
I'm trying to create a program that can add an array of values to each "column" in a sparse matrix, like so:
A = [[1,1,0],
[0,0,1],
[1,0,1]]
B = [[1],
[0],
[1]]
A + B = [[2,2,1],
[0,0,1],
[2,1,2]]
Represented in the coordinate format for sparse matrices it would look as follows:
A = [[0,0,1],
[2,0,1],
[0,1,1],
[1,2,1],
[2,2,1]]
B = [[0,0,1],
[2,0,1]]
A + B = [[0,0,2],
[2,0,2],
[0,1,2],
[2,1,1],
[0,2,1],
[1,2,1],
[2,2,2]]
I am dealing with large matrices that must be represented in a sparse manner due to their size. I need to be able to add a column of values to each column in a matrix, but with an algorithm that deals exclusively with the sparse triplets.
I have spent all day on this, literally 10 hours straight, and I am genuinely stunned I haven't been able to find a good answer for this anywhere. Performing this operation with multiplication is simple and highly efficient, but there doesn't seem to be any time and space-efficient solution built into scipy or numpy to do this, or if there is, it will kill me when I find out. I attempted to implement a solution, but it ended up being horribly inefficient.
Essentially, my solution, which does technically work but just with terrible time efficiency, follows these steps:
Check for shared values on rows between A and B, add the relevant triplet values together
Add the unique values from A
Add B_row_i, x_i, B_value_i for i in the columns of the matrix, checking to see that we're not adding a verbatim value from our A triplets.
At least, I think that's what it does... I'm totally burnt out now and I started to phase out while coding. If anyone could suggest and fast solutions that would be highly appreciated!
from scipy.sparse import coo_matrix
from tqdm import tqdm
class SparseCoordinates:
def __init__(self,coo_a,coo_b):
self.shape = coo_a.shape
self.coo_a_rows = coo_a.row
self.coo_a_cols = coo_a.col
self.coo_a_data = coo_a.data
self.coo_b_rows = coo_b.row
self.coo_b_cols = coo_b.col
self.coo_b_data = coo_b.data
self.coo_c_rows = []
self.coo_c_cols = []
self.coo_c_data = []
def __check(self,a,b,c,lr,lc,lv):
for i in range(len(lr)):
if lr[i] == a and lc[i] == b and lv[i] == c:
return True
return False
def __check_shared_rows(self):
for i in tqdm(range(len(self.coo_a_rows))):
for j in range(len(self.coo_b_rows)):
if self.coo_a_rows[i] == self.coo_b_rows[j]:
self.coo_c_rows.append(self.coo_a_rows[i])
self.coo_c_cols.append(self.coo_a_cols[i])
self.coo_c_data.append(self.coo_a_data[i] + self.coo_b_data[j])
def __add_unique_from_a(self):
a_unique = set(self.coo_a_rows) - set(self.coo_b_rows)
for i in tqdm(range(len(self.coo_a_rows))):
if self.coo_a_rows[i] in a_unique:
self.coo_c_rows.append(self.coo_a_rows[i])
self.coo_c_cols.append(self.coo_a_cols[i])
self.coo_c_data.append(self.coo_a_data[i])
def __add_all_remaining_from_b(self):
for i in tqdm(range(len(self.coo_b_rows))):
for j in range(self.shape[1]):
if not self.__check(self.coo_b_rows[i],j,self.coo_b_data[i],
self.coo_a_rows,self.coo_a_cols,self.coo_a_data):
self.coo_c_rows.append(self.coo_b_rows[i])
self.coo_c_cols.append(j)
self.coo_c_data.append(self.coo_b_data[i])
def add(self):
self.__check_shared_rows()
self.__add_unique_from_a()
self.__add_all_remaining_from_b()
return coo_matrix((self.coo_c_data,(self.coo_c_rows,self.coo_c_cols)),shape=self.shape)
For numpy arrays, A+B does the job because of broadcasting. broadcasting is implemented at core iteration levels, taking advantage of strides. scipy.sparse does not implement broadcasting. If B is expanded to (3,3) matrix to match A addition does work
In [76]: A = np.array([[1,1,0],
...: [0,0,1],
...: [1,0,1]])
...:
...: B = np.array([[1],
...: [0],
...: [1]])
In [77]: B.shape
Out[77]: (3, 1)
In [78]: A+B
Out[78]:
array([[2, 2, 1],
[0, 0, 1],
[2, 1, 2]])
Sparse:
In [79]: from scipy import sparse
In [81]: M=sparse.csr_matrix(A);N=sparse.csr_matrix(B)
Matrix multiplication is well developed for sparse matrices:
In [82]: M#N
Out[82]:
<3x1 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [84]: N.T#M
Out[84]:
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Column format>
Matrix multiplication is used for row or column indexing, and for summing.
Define a helper:
In [86]: O=sparse.csr_matrix([1,1,1])
In [87]: O
Out[87]:
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [88]: N#O
Out[88]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
Use that in the sum:
In [89]: M+N#O
Out[89]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [90]: _.A
Out[90]:
array([[2, 2, 1],
[0, 0, 1],
[2, 1, 2]])
This matrix is less sparse than M - fewer 0's. That reduces the benefits of using sparse matrices.
coo_matrix format is used to create matrices, and to build new ones, as with sparse.bmat, sparse.hstack and sparse.vstack, but csr_matrix is used most math. Sometimes it is possible to do custom math by iterating on 'rows' via the intptr array. There have been various SO answers which do that. Math using the coo attributes is generally not a good idea.
However, because coo duplicates are summed when converted to csr, I could probably set up a reasonable summation of the coo formats of M and N.
In [91]: Mo=M.tocoo(); No=N.tocoo()
In [95]: Oo=O.tocoo()
In [98]: rows = np.concatenate((Mo.row, Oo.row))
In [99]: cols = np.concatenate((Mo.col, Oo.col))
In [100]: data = np.concatenate((Mo.data, Oo.data))
In [101]: sparse.coo_matrix((data,(rows,cols)),M.shape)
Out[101]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in COOrdinate format>
In [102]: _.A
Out[102]:
array([[2, 2, 1],
[0, 0, 1],
[1, 0, 1]])
Here I joined the attributes of Mo and Oo. I could do the same with the attributes of No, but since I already has Oo, I decided it was worth the work. I'll leave that up to you.
In [103]: Oo.row,Oo.col,Oo.data
Out[103]:
(array([0, 0, 0], dtype=int32),
array([0, 1, 2], dtype=int32),
array([1, 1, 1]))
In [104]: No.row,No.col,No.data
Out[104]: (array([0, 2], dtype=int32), array([0, 0], dtype=int32), array([1, 1]))
I don't know if this coo approach is worth the effort.
Since you're already using scipy sparse arrays and numpy, you can do this quite efficiently. The actual implementation of efficient sparse operations can look a bit convoluted until you're comfortable working with the indexing of the sparse arrays, but numpy has some good tools for it.
You can do this in just a few lines, whether it's in COO, CSR or CSC format. As a quick overview, the compressed sparse matrix formats have 3 1-dimensional numpy arrays that actually store the data. The first, m.indptr has an entry for every row (for CSR) or column (for CSC) plus one. m.indptr[i] gives the index into the other two arrays where row i begins, so s = m.indptr[i]:m.indptr[i+1] gives a slice of all those indices in the row. The next array, m.indices, gives the index for each nonzero value in the uncompressed axis- columns for CSR, rows for CSC. And m.data gives the actual values at those positions. You can very often take advantage of this format to greatly simplify code and save on time. For example, if you wanted to add 1 to each nonzero value in column i:
convert to CSC format
find the indices of the nonzero values in column i with the indptr array
slice into the data array with those indices, and add in place (i.e. += 1)
m.data[m.indptr[i]:m.indptr[i+1]] += 1
That's it! Now, we need to add a different value to each row or column, potentially. Doing that step in a for loop is not practical for large problems. A helpful general technique is to construct an array with the same size as the data array initially filled with zeros, and then use some combination of slicing and numpy functions like take, diff, and repeat to prepare the changes using the indptr and indices arrays, then finally update the data array once in place.
When you're adding a vector to each column of the matrix, the modifications to values depend on row index only- so if we can figure out a way to efficiently add the correct value to each row at once, we're good. If we have a CSR matrix, we know exactly how many nonzero elements are in each row, and that they'll be sorted together in the data array. Using np.diff(m.indptr) extremely conveniently gives the number of nonzero elements in each row, and then m.data += np.repeat(vec, np.diff(m.indptr)) gives an array the same size as the data array, with vec[0] repeated once for every element in the first row, vec[1] repeated once for every element in the second row, and so on, and then adds the result to the data array. This requires O(nnz) time and memory only, which is optimal, and can make full use of the numpy vectorization and parallelism with no additional effort.
If you need to alter values based on indices in an uncompressed axis because of some other constraint, a slightly less efficient solution follows the exact same idea, but instead of using repeat and the indptr array, you use np.take(vec, m.indices). This creates an array with element i taken from the index of vec given by m.indices[i]- in other words, it takes the element of vec that corresponds to the row/column index of each nonzero element. This involves accessing memory in a less sequential way to pull the correct value from vec, but again in practice thanks to the typically smaller size of vec compared to the data array and that the data update is sequential, it is very fast.
Edited to add: the sparsefuncs module in scikit-learn has some very instructive code, along with the similar sparsefuncs_fast Cython module. You might notice that the methods I presented are very similar to the code used in the row and column scaling functions- that's where I learned it from.
I have a numpy array which has 100 rows and 16026 columns. I have to find the median of every column. So median for every column will be calculated from 100 observations (100 rows in this case). I am using the following code to achieve this:
for category in categories:
indices = np.random.randint(0, len(os.listdir(filepath + category)) - 1, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
And here's the output that I get:
(100, 16026)
(100, 16026)
(100, 16026)
(100, 16026)
My question is - why is the shape of medArray 100*16026 and not 1*16026? Because I am calculating the median of every column, I would expect only one row with 16026 columns. What am I missing here?
Please note that X_train is a sparse matrix.
X_train.shape
output:
(2034, 16026)
Any help in this regard is much appreciated.
Edit:
The above problem has been solved by toarray() function.
tempArray = X_train[indices, ].toarray()
I also figured that I was being stupid and also including all the zeroes in my median calculation and that's why I was getting 0 as the median all the time. Is there an easy way of calculating the median by removing/ignoring the zeroes across all the columns?
That's really strange, I think you should get (16026,), are we missing something here:
In [241]:
X_train=np.random.random((1000,16026)) #1000 can be any int.
indices = np.random.randint(0, 60, 100) #60 can be any int.
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
(16026,)
And the only way you can get a 2d array result is:
In [243]:
X_train=np.random.random((100,2,16026))
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
(2, 16026)
When you have a 3d array input.
When it is a sparse array, a dumb way to get around this might be:
In [319]:
X_train = sparse.rand(112, 16026, 0.5, 'csr') #just make up a random sparse array
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray.toarray(), axis=0)
print(medArray.shape)
(16026,)
.toarray() might also go to the 3rd line instead. But either way, this means the 0's are also counted as #zhangxaochen pointed out.
Out of ideas, there may be better explanations for it.
The problem is that NumPy doesn't recognize sparse matrices as arrays or array-like objects. For example, calling asanyarray on a sparse matrix returns a 0D array whose one element is the original sparse matrix:
In [8]: numpy.asanyarray(scipy.sparse.csc_matrix([[1,2,3],[4,5,6]]))
Out[8]:
array(<2x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>, dtype=object)
Like most of NumPy, numpy.median relies on having an array or array-like object as input. The routines it relies on, particularly sort, won't understand what they're looking at if you give it a sparse matrix.
I was finally able to solve this. I used masked arrays and the following code:
sample = []
sample_size = 50
idx = matplotlib.mlab.find(newsgroups_train.target==i)
random_index = []
for j in range(sample_size):
random_index.append(randrange(0,len(idx)-1))
y = np.ma.masked_where(X_train[sample[0]].toarray() == 0, X_train[sample[0]].toarray())
medArray = np.ma.median(y, axis=0).filled(0)
print '============median ' + newsgroups_train.target_names[i] + '============='
for k,word in enumerate(np.array(vectorizer.get_feature_names())[np.argsort(medArray)[::-1][0:10]]):
print word + ':' + str(np.sort(medArray)[::-1][k])
This gave me the median ignoring zeros.
How can you iterate over all 2^(n^2) binary n by n matrices (or 2d arrays) in numpy? I would something like:
for M in ....:
Do you have to use itertools.product([0,1], repeat = n**2) and then convert to a 2d numpy array?
This code will give me a random 2d binary matrix but that isn't what I need.
np.random.randint(2, size=(n,n))
Note that 2**(n**2) is a big number for even relatively small n, so your loop might run indefinetely long.
Being said that, one possible way to iterate matrices you need is for example
nxn = np.arange(n**2).reshape(n, -1)
for i in xrange(0, 2**(n**2)):
arr = (i >> nxn) % 2
# do smthng with arr
np.array(list(itertools.product([0,1], repeat = n**2))).reshape(-1,n,n)
produces a (2^(n^2),n,n) array.
There may be some numpy 'grid' function that does the same, but my recollection from other discussions is that itertools.product is pretty fast.
g=(np.array(x).reshape(n,n) for x in itertools.product([0,1], repeat = n**2))
is a generator that produces the nxn arrays one at time:
g.next()
# array([[0, 0],[0, 0]])
Or to produce the same 3d array:
np.array(list(g))