numpy rearanging when mixing integer and list indices [duplicate] - python

Original question
I am getting a very odd error message when I try to assign some of the elements of an array. I am using a combination of a slice and a set of indices. See the following simple example.
import scipy as sp
a = sp.zeros((3, 4, 5))
b = sp.ones((4, 5))
I = sp.array([0, 1, 3])
b[:, I] = a[0, :, I]
This code raises the following ValueError:
ValueError: shape mismatch: value array of shape (3,4) could not be broadcast to indexing result of shape (3,4)
--
Follow up
Be careful when using a combination of a slice and seq. of integers. As pointed out on github:
x = rand(3, 5, 7)
print(x[0, :, [0,1]].shape)
# (2, 5)
print(x[0][:, [0,1]].shape)
# (5, 2)
This is how numpy is designed to work, but it is nevertheless a bit confusing that x[0][:, I] is not the same as x[0, :, I]. Since this is the behavior I want I choose to use x[0][:, I] in my code.

Looks like there are some errors in copying your code to question.
But I suspect there's a known problem with indexing:
In [73]: a=np.zeros((2,3,4)); b=np.ones((3,4)); I=np.array([0,1])
Make I 2 elements. Indexing b gives the expected (3,2) shape. 3 rows from the slice, 2 columns from I indexing
In [74]: b[:,I].shape
Out[74]: (3, 2)
But with 3d a we get the transpose.
In [75]: a[0,:,I].shape
Out[75]: (2, 3)
and assignment would produce an error
In [76]: b[:,I]=a[0,:,I]
...
ValueError: array is not broadcastable to correct shape
It's putting the 2 element dimension defined by I first, and the 3 element from : second. It's a case of mixed advanced indexing that has been discussed earlier - and there's a bug issue as well. (I'll have to look those up).
You are probably using a newer numpy (or scipy) and getting a different error message.
It's documented that indexing with two arrays or lists, and slice in the middle, puts the slice at the end, e.g.
In [86]: a[[[0],[0],[1],[1]],:,[0,1]].shape
Out[86]: (4, 2, 3)
The same thing is happening with a[0,:,[0,1]]. But there's a good argument that it shouldn't be this way.
As to a fix, you could transpose a value, or change the indexing
In [88]: b[:,I]=a[0:1,:,I]
In [90]: b[:,I]=a[0,:,I].T
In [91]: b
Out[91]:
array([[ 0., 0., 1., 1.],
[ 0., 0., 1., 1.],
[ 0., 0., 1., 1.]])
In [92]: b[:,I]=a[0][:,I]
https://github.com/numpy/numpy/issues/7030
https://github.com/numpy/numpy/pull/6256

First of all it looks like you're missing a comma on the line 6:
I = sp.array([0,1,4])
Secondly, I would expect the value 4 in the array I to raise an IndexError, since both a and b have a max dimension of 4. I suspect you might want:
I = sp.array([0,1,3])
Making these changes run the program for me, and I got b as:
[[ 0. 0. 1. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 1. 0.]]
Which I suspect is what you want.

Here I get this error with indices [0,1,4]:
IndexError: index 4 is out of bounds for axis 2 with size 4
Its suggesting the value 4 is being used as an index, while the SIZE 4 implies the max index would be 3.
EDIT: now that you changed it to [0, 1, 3], it's running fine here.
EDIT: with your current code, I get the same error, but when I print the arrays themselves, they have a transverse shape:
print b[:, I]
print a[0, :, I]
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]

Related

Assigning values to a numpy array to zeros array

I've been trying to assign values from an array to another array, specifically from an array with values to a zeros array. Position of these values in the zeros array is also very essential. This is also a small piece of a bigger code, the bigger picture is to be able to import values from an excel spreadsheet into a zeros matrix. This is my problem:
import numpy as np
x = np.zeros((2,3))
P= np.asarray ([1,2,3,4,5,6])
for i in range(0,2):
for j in range(0,3):
x[i,j] = P[(i-1)*3+j] # 3 is the counter in x direction, nx
x
With this code, the output is (which is what I want):
array([[4., 5., 6.],
[1., 2., 3.]])
However if I try to the expand the array, as such:
import numpy as np
x = np.zeros((3,3))
P= np.asarray ([1,2,3,4,5,6,7,8,9])
for i in range(0,3):
for j in range(0,3):
x[i,j] = P[(i-1)*3+j] # 3 is the counter in x direction, nx
x
The output is:
array([[7., 8., 9.],
[1., 2., 3.],
[4., 5., 6.]])
I expect the output to be:
array([[7., 8., 9.],
[4., 5., 6.],
[1., 2., 3.]])
Is there a reason why the ouput is changing with the expansion of the array?
You don't need to iterate:
In [323]: P=np.arange(1,10).reshape(3,3)[::-1,:]
In [324]: P
Out[324]:
array([[7, 8, 9],
[4, 5, 6],
[1, 2, 3]])
As for your loop, look a the i,j's:
In [325]: for i in range(3):
...: for j in range(3):
...: print(i,j,(i-1)*3+j)
...:
0 0 -3
0 1 -2
0 2 -1
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5
You don't need to use loop, just use flip() with reshape().
import numpy as np
m = 3 # number of rows you want
n = 3 # number of column you want
P = np.asarray ([1,2,3,4,5,6,7,8,9])
P = np.flip(P.reshape(m,n), axis=0)
print(P)
[[7 8 9]
[4 5 6]
[1 2 3]]
If you want to assign it to a zero matrix, you can just iterate through the indices.
For example, let's say you have a much bigger zero matrix, you want to fill row x, y, z with the current matrix generated.
zero = np.zeros((10, 3))
print(zero.shape)
zero[[2, 5, 7], : ] = P # randomly assigning P to index 2, 5, 7th row of zero matrix
print(zero)
(10, 3)
[[0. 0. 0.]
[0. 0. 0.]
[7. 8. 9.]
[0. 0. 0.]
[0. 0. 0.]
[4. 5. 6.]
[0. 0. 0.]
[1. 2. 3.]
[0. 0. 0.]
[0. 0. 0.]]
You can also loop through:
for i in range(3):
zero[i,:] = P[i,:]

Insert zero rows and columns at the same time at specific indices instead of at the end

I have a 2D array (a confusion matrix), for example (3,3). The number in the array refers to the index into a set of labels.
I know that this array should actually be (5,5) instead of (3,3), for the 5 row and column labels. I can find the labels that have been "hit":
import numpy as np
x = np.array([[3, 0, 3],
[0, 2, 0],
[2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x) # array([1, 4]
I know that the row and column for the missed index is all zero, so the output I want is this:
y = np.array([[3, 0, 0, 3, 0],
[0, 0, 0, 0, 0], # <- Inserted row at index 1 all zeros
[0, 0, 2, 0, 0],
[2, 0, 3, 3, 0],
[0, 0, 0, 0, 0]]) # <- Inserted row at index 4 all zeros
# ^ ^
# | |
# Inserted columns at index 1 and 4 all zeros
I can do that with multiple calls to np.insert in a loop over all missing indices:
def insert_rows_columns_at_slow(arr, indices):
result = arr.copy()
for idx in indices:
result = np.insert(result, idx, np.zeros(result.shape[1]), 0)
result = np.insert(result, idx, np.zeros(result.shape[0]), 1)
However, my real array is much bigger, and there may be many more missing indices. Since np.insert re-allocates every time, this is not very efficient.
How can I achieve the same result, but in a more efficient, vectorized way? Bonus points if it works in more than 2 dimensions.
Just another option:
Instead of using the missing indeces, use the non missing indeces:
non_missing_idxs = np.union1d(np.arange(len(labels)), x) # array([0, 2, 3])
y = np.zeros((5,5))
y[non_missing_idxs[:,None], non_missing_idxs] = x
output:
array([[3., 0., 0., 3., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 2., 0., 0.],
[2., 0., 3., 3., 0.],
[0., 0., 0., 0., 0.]])
You can do this by pre-allocating the full resulting array and filling the rows and columns with the old array, even in multiple dimensions, and the dimensions don't have to match size:
def insert_at(arr, output_size, indices):
"""
Insert zeros at specific indices over whole dimensions, e.g. rows and/or columns and/or channels.
You need to specify indices for each dimension, or leave a dimension untouched by specifying
`...` for it. The following assertion should hold:
`assert len(output_size) == len(indices) == len(arr.shape)`
:param arr: The array to insert zeros into
:param output_size: The size of the array after insertion is completed
:param indices: The indices where zeros should be inserted, per dimension. For each dimension, you can
specify: - an int
- a tuple of ints
- a generator yielding ints (such as `range`)
- Ellipsis (=...)
:return: An array of shape `output_size` with the content of arr and zeros inserted at the given indices.
"""
# assert len(output_size) == len(indices) == len(arr.shape)
result = np.zeros(output_size)
existing_indices = [np.setdiff1d(np.arange(axis_size), axis_indices,assume_unique=True)
for axis_size, axis_indices in zip(output_size, indices)]
result[np.ix_(*existing_indices)] = arr
return result
For your use-case, you can use it like this:
def fill_by_label(arr, labels):
# If this is your only use-case, you can make it more efficient
# By not computing the missing indices first, just to compute
# The existing indices again
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
return insert_at(arr, output_size=(len(labels), len(labels)),
indices=(missing_idxs, missing_idxs))
x = np.array([[3, 0, 3],
[0, 2, 0],
[2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
print(fill_by_label(x, labels))
>> [[3. 0. 0. 3. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[2. 0. 3. 3. 0.]
[0. 0. 0. 0. 0.]]
But this is very flexible. You can use it for zero padding:
def zero_pad(arr):
out_size = np.array(arr.shape) + 2
indices = (0, out_size[0] - 1), (0, out_size[1] - 1)
return insert_at(arr, output_size=out_size,
indices=indices)
print(zero_pad(x))
>> [[0. 0. 0. 0. 0.]
[0. 3. 0. 3. 0.]
[0. 0. 2. 0. 0.]
[0. 2. 3. 3. 0.]
[0. 0. 0. 0. 0.]]
It also works with non-quadratic inputs and outputs:
x = np.ones((3, 4))
print(insert_at(x, (4, 5), (2, 3)))
>>[[1. 1. 1. 0. 1.]
[1. 1. 1. 0. 1.]
[0. 0. 0. 0. 0.]
[1. 1. 1. 0. 1.]]
With different number of insertions per dimension:
x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, (2, 4))))
>> [[1. 1. 0. 1. 0. 1.]
[0. 0. 0. 0. 0. 0.]
[1. 1. 0. 1. 0. 1.]
[1. 1. 0. 1. 0. 1.]]
You can use range (or other generators) instead of enumerating every index:
x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, range(2, 4))))
>>[[1. 1. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0.]
[1. 1. 0. 0. 1. 1.]
[1. 1. 0. 0. 1. 1.]]
It works with arbitrary dimensions (as long as you specify indices for every dimension)1:
x = np.ones((2, 2, 2))
print(insert_at(x, (3, 3, 3), (0, 0, 0)))
>>>[[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
[[0. 0. 0.]
[0. 1. 1.]
[0. 1. 1.]]
[[0. 0. 0.]
[0. 1. 1.]
[0. 1. 1.]]]
You can use Ellipsis (=...) to indicate that you don't want to change a dimension1,2:
x = np.ones((2, 2))
print(insert_at(x, (2, 4), (..., (0, 1))))
>>[[0. 0. 1. 1.]
[0. 0. 1. 1.]]
1: You could automatically detect this based on arr.shape and output_size, and fill it with ... as needed, but I'll leave that up to you if you need it. If you wanted to, you could probably get rid of the output_size parameter instead, but then it gets trickier with passing in generators.
2: This is somewhat different to the normal numpy ... semantics, as you need to specify ... for every dimension that you want to keep, i.e. the following does NOT work:
x = np.ones((2, 2, 2))
print(insert_at(x, (2, 2, 3), (..., 0)))
For timing, I ran the insertion of 10 rows and columns into a 90x90 array 100000 times, this is the result:
x = np.random.random(size=(90, 90))
indices = np.arange(10) * 10
def measure_time_fast():
insert_at(x, (100, 100), (indices, indices))
def measure_time_slow():
insert_rows_columns_at_slow(x, indices)
if __name__ == '__main__':
import timeit
for speed in ("fast", "slow"):
times = timeit.repeat(f"measure_time_{speed}()", setup=f"from __main__ import measure_time_{speed}", repeat=10, number=10000)
print(f"Min: {np.min(times) / 10000}, Max: {np.max(times) / 10000}, Mean: {np.mean(times) / 10000} seconds per call")
For the fast version:
Min: 7.336409069976071e-05, Max: 7.7440657400075e-05, Mean:
7.520040466995852e-05 seconds per call
That is about 75 microseconds.
For your slow version:
Min: 0.00028272533010022016, Max: 0.0002923079213000165, Mean:
0.00028581595062998535 seconds per call
That is about 300 microseconds.
The difference will be greater, the bigger the arrays get. E.g. for inserting 100 rows and columns into a 900x900 array, these are the results (ran only 1000 times):
Fast version:
Min: 0.00022916630539984907, Max: 0.0022916630539984908, Mean:
0.0022916630539984908 seconds per call
Slow version:
Min: 0.013766934227399906, Max: 0.13766934227399907, Mean:
0.13766934227399907 seconds per call

Efficiently index 2d numpy array using two 1d arrays

I have a large 2d numpy array and two 1d arrays that represent x/y indexes within the 2d array. I want to use these 1d arrays to perform an operation on the 2d array.
I can do this with a for loop, but it's very slow when working on a large array. Is there a faster way? I tried using the 1d arrays simply as indexes but that didn't work. See this example:
import numpy as np
# Two example 2d arrays
cnt_a = np.zeros((4,4))
cnt_b = np.zeros((4,4))
# 1d arrays holding x and y indices
xpos = [0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3]
ypos = [3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0]
# This method works, but is very slow for a large array
for i in range(0,len(xpos)):
cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
# This method is fast, but gives incorrect answer
cnt_b[xpos,ypos] = cnt_b[xpos,ypos]+1
# Print the results
print 'Good:'
print cnt_a
print ''
print 'Bad:'
print cnt_b
The output from this is:
Good:
[[ 2. 1. 2. 1.]
[ 0. 3. 1. 2.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
Bad:
[[ 1. 1. 1. 1.]
[ 0. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 0. 0. 0.]]
For the cnt_b array numpy is obviously not summing correctly, but I'm unsure how to fix this without resorting to the (v. inefficient) for loop used to calculate cnt_a.
Another approach by using 1D indexing (suggested by #Shai) extended to answer the actual question:
>>> out = np.zeros((4, 4))
>>> idx = np.ravel_multi_index((xpos, ypos), out.shape) # extract 1D indexes
>>> x = np.bincount(idx, minlength=out.size)
>>> out.flat += x
np.bincount calculates how many times each of the index is present in the xpos, ypos and stores them in x.
Or, as suggested by #Divakar:
>>> out.flat += np.bincount(idx, minlength=out.size)
We could compute the linear indices, then accumulate into zeros-initialized output array with np.add.at. Thus, with xpos and ypos as arrays, here's one implementation -
m,n = xpos.max()+1, ypos.max()+1
out = np.zeros((m,n),dtype=int)
np.add.at(out.ravel(), xpos*n+ypos, 1)
Sample run -
In [95]: # 1d arrays holding x and y indices
...: xpos = np.array([0,0,1,2,1,2,1,0,0,0,0,1,1,1,2,2,3])
...: ypos = np.array([3,2,1,1,3,0,1,0,0,1,2,1,2,3,3,2,0])
...:
In [96]: cnt_a = np.zeros((4,4))
In [97]: # This method works, but is very slow for a large array
...: for i in range(0,len(xpos)):
...: cnt_a[xpos[i],ypos[i]] = cnt_a[xpos[i],ypos[i]] + 1
...:
In [98]: m,n = xpos.max()+1, ypos.max()+1
...: out = np.zeros((m,n),dtype=int)
...: np.add.at(out.ravel(), xpos*n+ypos, 1)
...:
In [99]: cnt_a
Out[99]:
array([[ 2., 1., 2., 1.],
[ 0., 3., 1., 2.],
[ 1., 1., 1., 1.],
[ 1., 0., 0., 0.]])
In [100]: out
Out[100]:
array([[2, 1, 2, 1],
[0, 3, 1, 2],
[1, 1, 1, 1],
[1, 0, 0, 0]])
you can iterate on both lists at once, and increment for each couple (if you are not used to it, zip can combine lists)
for x, y in zip(xpos, ypos):
cnt_b[x][y] += 1
But this will be about the same speed as your solution A.
If your lists xpos/ypos are of length n, I don't see how you can update your matrix in less than o(n) since you'll have to check each pair one way or an other.
Other solution: you could count (with collections.Counter possibly) the similar index pairs (ex: (0, 3) etc...) and update the matrix with the count value. But I doubt it would be much faster, since you the time gained on updating the matrix would be lost on counting multiple occurrences.
Maybe I am totally wrong tho, in which case I'd be curious too to see a not o(n) answer
I think you are looking for ravel_multi_index funciton
lidx = np.ravel_multi_index((xpos, ypos), cnt_a.shape)
converts to "flatten" 1D indices into cnt_a and cnt_b:
np.add.at( cnt_b, lidx, 1 )

numpy classification comparison with 3d array

I'm trying to do some basic classification of numpy arrays...
I want to compare a 2d array against a 3d array, along the 3rd dimension, and make a classification based on the corresponding z-axis values.
so given 3 arrays that are stacked into a 3d array:
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = dstack((a1,a2,a3))
and another 2d array
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
I want to be able to compare a2d against a3d, and return a 2d array of which level of a3d is closest. (or I suppose any custom function that can compare each value along the z-axis, and return a value base on that comparison.)
EDIT
I modified my arrays to more closely match my data. a1 would be the minimum values, a2 the average values, and a3 the maximum values. So I want to output if each a2d value is closer to a1 (classed "1") a2 (classed "2") or a3 (classed "3"). I'm doing as a 3d array because in the real data, it won't be a simple 3-array choice, but for SO purposes, it helps to keep it simple. We can assume that in the case of a tie, we'll take the lower, so 2 would be classed as level "1", 4 as level "2".
You can use the following list comprehension :
>>> [sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,a2d) for i in a3d]]
[30.0, 22.5, 30.0]
In preceding code i create the following list with zip,that is the zip of each sub array of your 3d list then all you need is calculate the sum of the elemets of subtract of those pairs then sum of them again :
>>> [zip(i,a2d) for i in a3d]
[[(array([ 1., 3., 1.]), array([1, 2, 1])), (array([ 2., 2., 1.]), array([5, 5, 4])), (array([ 3., 1., 1.]), array([9, 8, 8]))], [(array([ 4., 6., 4.]), array([1, 2, 1])), (array([ 5. , 6.5, 4. ]), array([5, 5, 4])), (array([ 6., 4., 4.]), array([9, 8, 8]))], [(array([ 7., 9., 7.]), array([1, 2, 1])), (array([ 8., 8., 7.]), array([5, 5, 4])), (array([ 9., 7., 7.]), array([9, 8, 8]))]]
then for all of your sub arrays you'll have the following list:
[30.0, 22.5, 30.0]
that for each sub-list show a the level of difference with 2d array!and then you can get the relative sub-array from a3d like following :
>>> a3d[l.index(min(l))]
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
Also you can put it in a function:
>>> def find_nearest(sub,main):
... l=[sum(sum(abs(i-j)) for i,j in z) for z in [zip(i,sub) for i in main]]
... return main[l.index(min(l))]
...
>>> find_nearest(a2d,a3d)
array([[ 4. , 6. , 4. ],
[ 5. , 6.5, 4. ],
[ 6. , 4. , 4. ]])
You might consider a different approach using numpy.vectorize which lets you efficiently apply a python function to each element of your array.
In this case, your python function could just classify each pixel with whatever breaks you define:
import numpy as np
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
def classify(x):
if x >= 4:
return 3
elif x >= 2:
return 2
elif x > 0:
return 1
else:
return 0
vclassify = np.vectorize(classify)
result = vclassify(a2d)
Thanks to #perrygeo and #Kasra - they got me thinking in a good direction.
Since I want a classification of the closest 3d array's z value, I couldn't do simple math - I needed the (z)index of the closest value.
I did it by enumerating both axes of the 2d array, and doing a proximity compare against the corresponding (z)index of the 3d array.
There might be a way to do this without iterating the 2d array, but at least I'm avoiding iterating the 3d.
import numpy as np
a1 = np.array([[1,1,1],[1,1,1],[1,1,1]])
a2 = np.array([[3,3,3],[3,3,3],[3,3,3]])
a3 = np.array([[5,5,5],[5,5,5],[5,5,5]])
a3d = np.dstack((a1,a2,a3))
a2d = np.array([[1,2,4],[5,5,2],[2,3,3]])
classOut = np.empty_like(a2d)
def find_nearest_idx(array,value):
idx = (np.abs(array-value)).argmin()
return idx
# enumerate to get indices
for i,a in enumerate(a2d):
for ii,v in enumerate(a):
valStack = a3d[i,ii]
nearest = find_nearest_idx(valStack,v)
classOut[i,ii] = nearest
print classOut
which gets me
[[0 0 1]
[2 2 0]
[0 1 1]]
This tells me that (for example) a2d[0,0] is closest to the 0-index of a3d[0,0], which in my case means it is closest to the min value for that 2d position. a2d[1,1] is closest to the 2-index, which in my case means closer to the max value for that 2d position.

Scipy Sparse Matrix special substraction

I'm doing a project and I'm doing a lot of matrix computation in it.
I'm looking for a smart way to speed up my code. In my project, I'm dealing with a sparse matrix of size 100Mx1M with around 10M non-zeros values. The example below is just to see my point.
Let's say I have:
A vector v of size (2)
A vector c of size (3)
A sparse matrix X of size (2,3)
v = np.asarray([10, 20])
c = np.asarray([ 2, 3, 4])
data = np.array([1, 1, 1, 1])
row = np.array([0, 0, 1, 1])
col = np.array([1, 2, 0, 2])
X = coo_matrix((data,(row,col)), shape=(2,3))
X.todense()
# matrix([[0, 1, 1],
# [1, 0, 1]])
Currently I'm doing:
result = np.zeros_like(v)
d = scipy.sparse.lil_matrix((v.shape[0], v.shape[0]))
d.setdiag(v)
tmp = d * X
print tmp.todense()
#matrix([[ 0., 10., 10.],
# [ 20., 0., 20.]])
# At this point tmp is csr sparse matrix
for i in range(tmp.shape[0]):
x_i = tmp.getrow(i)
result += x_i.data * ( c[x_i.indices] - x_i.data)
# I only want to do the subtraction on non-zero elements
print result
# array([-430, -380])
And my problem is the for loop and especially the subtraction.
I would like to find a way to vectorize this operation by subtracting only on the non-zero elements.
Something to get directly the sparse matrix on the subtraction:
matrix([[ 0., -7., -6.],
[ -18., 0., -16.]])
Is there a way to do this smartly ?
You don't need to loop over the rows to do what you are already doing. And you can use a similar trick to perform the multiplication of the rows by the first vector:
import scipy.sparse as sps
# number of nonzero entries per row of X
nnz_per_row = np.diff(X.indptr)
# multiply every row by the corresponding entry of v
# You could do this in-place as:
# X.data *= np.repeat(v, nnz_per_row)
Y = sps.csr_matrix((X.data * np.repeat(v, nnz_per_row), X.indices, X.indptr),
shape=X.shape)
# subtract from the non-zero entries the corresponding column value in c...
Y.data -= np.take(c, Y.indices)
# ...and multiply by -1 to get the value you are after
Y.data *= -1
To see that it works, set up some dummy data
rows, cols = 3, 5
v = np.random.rand(rows)
c = np.random.rand(cols)
X = sps.rand(rows, cols, density=0.5, format='csr')
and after run the code above:
>>> x = X.toarray()
>>> mask = x == 0
>>> x *= v[:, np.newaxis]
>>> x = c - x
>>> x[mask] = 0
>>> x
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
>>> Y.toarray()
array([[ 0.79935123, 0. , 0. , -0.0097763 , 0.59901243],
[ 0.7522559 , 0. , 0.67510109, 0. , 0.36240006],
[ 0. , 0. , 0.72370725, 0. , 0. ]])
The way you are accumulating your result requires that there are the same number of non-zero entries in every row, which seems a pretty weird thing to do. Are you sure that is what you are after? If that's really what you want you could get that value with something like:
result = np.sum(Y.data.reshape(Y.shape[0], -1), axis=0)
but I have trouble believing that is really what you are after...

Categories

Resources