Efficiently applying a threshold function to SciPy sparse csr_matrix - python

I have a SciPy csr_matrix (a vector in this case) of 1 column and x rows. In it are float values which I need to convert to the discrete class labels -1, 0 and 1. This should be done with a threshold function which maps the float values to one of these 3 class labels.
Is there no way other than iterating over the elements as described in Iterating through a scipy.sparse vector (or matrix)? I would love to have some elegant way to just somehow map(thresholdfunc()) on all elements.
Note that while it is of type csr_matrix, it isn't actually sparse as it's just the return of another function where a sparse matrix was involved.

If you have an array, you can discretize based on some condition with the np.where function. e.g.:
>>> import numpy as np
>>> x = np.arange(10)
>>> np.where(x < 5, 0, 1)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
The syntax is np.where(BOOLEAN_ARRAY, VALUE_IF_TRUE, VALUE_IF_FALSE).
You can chain together two where statements to have multiple conditions:
>>> np.where(x < 3, -1, np.where(x > 6, 0, 1))
array([-1, -1, -1, 1, 1, 1, 1, 0, 0, 0])
To apply this to your data in the CSR or CSC sparse matrix, you can use the .data attribute, which gives you access to the internal array containing all the nonzero entries in the sparse matrix. For example:
>>> from scipy import sparse
>>> mat = sparse.csr_matrix(x.reshape(10, 1))
>>> mat.data = np.where(mat.data < 3, -1, np.where(mat.data > 6, 0, 1))
>>> mat.toarray()
array([[ 0],
[-1],
[-1],
[ 1],
[ 1],
[ 1],
[ 1],
[ 0],
[ 0],
[ 0]])

Related

Subtraction of different sizes numpy arrays

I have asked a previous question, but I think my example was not clear. I am still trying to subtract two different sizes of numpy arrays from a list of numpy arrays. For example:
####Data####
### For same size numpy arrays the subtraction works fine!!!!###
easy_data= [[1,2,3],[2,2,2]],[[1,2,3],[1,2,5]]
d = [np.array(i) for i in easy_data] # List of numpy arrays
res = d[1] - d[0]
>> array([[ 0, 0, 0],
[-1, 0, 3]])
##### Current Issue ####
data = [[1,2,3],[2,2,2]],[[1,2,3],[1,2,5],[1,1,1]]
d = [np.array(i) for i in data]
res = d[1] - d[0] #### As the sizes are different I can't subtract them ###
Desired Output
array([[ 0, 0, 0],
[-1, 0, 3],[1,1,1])
I am kind of slow getting how to work with numpy arrays but I can't figure out how to make this work? Can anybody help me?
It's easiest to operate on a slice. If you do not want to erase the original array, use a copy:
>>> res=d[1].copy()
>>> res[:d[0].shape[0]]-=d[0]
>>> res
array([[ 0, 0, 0],
[-1, 0, 3],
[ 1, 1, 1]])

Normalize scipy sparse matrix with number of nonzero elements

I want to divide each row of the csr_matrix by the number of non zero entries in that row.
For example : Consider a csr_matrix A:
A = [[6, 0, 0, 4, 0], [3, 18, 0, 9, 0]]
Result = [[3, 0, 0, 2, 0], [1, 6, 0, 3, 0]]
What's the shortest and efficient way to do it ?
Get the counts with getnnz method and then replicate and divide in-place into its flattened view obtained with data method -
s = A.getnnz(axis=1)
A.data /= np.repeat(s, s)
Inspired by Row Division in Scipy Sparse Matrix 's solution post : Approach #2.
Sample run -
In [15]: from scipy.sparse import csr_matrix
In [16]: A = csr_matrix([[6, 0, 0, 4, 0], [3, 18, 0, 9, 0]])
In [18]: s = A.getnnz(axis=1)
...: A.data /= np.repeat(s, s)
In [19]: A.toarray()
Out[19]:
array([[3, 0, 0, 2, 0],
[1, 6, 0, 3, 0]])
Note: To be compatible between Python2 and 3, we might want to use // -
A.data //= ...
Divakar gives an in-place method. My trial creates a new array.
from scipy import sparse
A = sparse.csr_matrix([[6, 0, 0, 4, 0], [3, 18, 0, 9, 0]])
A.multiply(1.0/(A != 0).sum(axis=1))
We multiply the inverse values of the sum of non-zero parts in each row. Note that one may want to make sure there is no dividing-by-zero errors.
As Divakar pointed out: 1.0, instead of 1, is needed at A.multiply(1.0/...) to be compatible with Python 2.

Doubling the matrix in numpy

Let's say I have a matrix in of size mXn.
I am trying to create a matrix out of size 2mX2n such that
the out matrix contains essentially the same elements as the in matrix,
except that the values are alternated with zeros.
For example:
in = [[ 1,2,3],
[4,5,6]]
out = [[1,0,2,0,3,0],
[0,0,0,0,0,0],
[4,0,5,0,6,0],
[0,0,0,0,0,0]]
Is there a vectorized way to achieve this?
Use NumPy:
import numpy as np
Your data:
a = np.array([[ 1,2,3],
[4,5,6]])
Create an array twice the size along both dimensions:
b = np.zeros([x * 2 for x in a.shape], dtype=a.dtype))
Assign the value of a to each second value of b, again in both dimensions:
b[::2,::2] = a
The result:
>>> b
array([[1, 0, 2, 0, 3, 0],
[0, 0, 0, 0, 0, 0],
[4, 0, 5, 0, 6, 0],
[0, 0, 0, 0, 0, 0]])

numpy - Sample repeatedly from matrix using np.random.choice

I have a 2D array, where each row is a direction:
directions = np.array([[ 1, 0],
[-1, 0],
[ 0, 1],
[ 0,-1]])
I want to sample several rows from this, and then do a cumsum (to simulate a random walk). The best approach would be to use np.random.choice. For instance, to sample 10 steps, do this:
np.random.choice(directions, size=(10,1))
# returns 2D array of shape (10,2), where each row is
# randomly sampled from the previous one
When I run this, I get the error:
ValueError: a must be 1-dimensional
Now, I realize I have a 2D array, but shouldn't it act like a 1D-array of 1D arrays in this context? Isn't this how the broadcasting rules work?
So, my questions is how do I make this 2D array act as a 1D array of 1D arrays (i.e., the 2 element columns).
The easiest thing would probably be to use indexing. The first argument to choice is described as follows:
If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a was np.arange(n)
You can do this:
directions = np.array([[ 1, 0],
[-1, 0],
[ 0, 1],
[ 0,-1]])
sampleInd = np.random.choice(directions.shape[0], size=(10,))
sample = directions[sampleInd]
Note that if you want the result to be a 2D array, specify the choice output as a 1D (10,) vector rather than (10, 1), which is 2D.
Now the final destination of your random walk is
destination = np.sum(sample, axis = 0)
The argument axis = 0 is necessary because otherwise sum will add up all the elements in the 2D sample array rather than adding each column separately.
An alternative to numpy.random.choice is to use random.choice in the standard library.
In [1]: import numpy as np
In [2]: directions = np.array([[1,0],[-1,0],[0,1],[0,-1]])
In [3]: directions
Out[3]:
array([[ 1, 0],
[-1, 0],
[ 0, 1],
[ 0, -1]])
In [4]: from numpy.random import choice
In [5]: choice(directions)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-dd768952d6d1> in <module>()
----> 1 choice(directions)
mtrand.pyx in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:10365)()
ValueError: a must be 1-dimensional
In [6]: import random
In [7]: random.choice(directions)
Out[7]: array([ 0, -1])
In [8]: choices = []
In [9]: for i in range(10):
...: choices.append(random.choice(directions))
...:
In [10]: choices
Out[10]:
[array([1, 0]),
array([ 0, -1]),
array([ 0, -1]),
array([-1, 0]),
array([1, 0]),
array([ 0, -1]),
array([ 0, -1]),
array([ 0, -1]),
array([-1, 0]),
array([1, 0])]
In [11]:

Joining two 2D numpy arrays into a single 2D array of 2-tuples

I have two 2D numpy arrays like this, representing the x/y distances between three points. I need the x/y distances as tuples in a single array.
So from:
x_dists = array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
y_dists = array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
I need:
dists = array([[[ 0, 0], [-1, -1], [-2, -2]],
[[ 1, 1], [ 0, 0], [-1, -1]],
[[ 2, 2], [ 1, 1], [ 0, 0]]])
I've tried using various permutations of dstack/hstack/vstack/concatenate, but none of them seem to do what I want. The actual arrays in code are liable to be gigantic, so iterating over the elements in python and doing the rearrangement "manually" isn't an option speed-wise.
Edit:
This is what I came up with in the end: https://gist.github.com/807656
import numpy as np
dists = np.vstack(([x_dists.T], [y_dists.T])).T
returns dists like you wanted them. Afterwards it is not "a single 2D array of 2-tuples", but a normal 3D array where the third axis is the concatenation of the two original arrays.
You see:
dists.shape # (3, 3, 2)
numpy.rec.fromarrays([x_dists, y_dists], names='x,y')

Categories

Resources