list of numpy vectors to sparse array - python

I have a list of numpy vectors of the format:
[array([[-0.36314615, 0.80562619, -0.82777381, ..., 2.00876354,2.08571887, -1.24526026]]),
array([[ 0.9766923 , -0.05725135, -0.38505339, ..., 0.12187988,-0.83129255, 0.32003683]]),
array([[-0.59539878, 2.27166874, 0.39192573, ..., -0.73741573,1.49082653, 1.42466276]])]
here, only 3 vectors in the list are shown. I have 100s..
The maximum number of elements in one vector is around 10 million
All the arrays in the list have unequal number of elements but the maximum number of elements is fixed.
Is it possible to create a sparse matrix using these vectors in python such that I have zeros in place of elements for the vectors which are smaller than the maximum size?

Try this:
from scipy import sparse
M = sparse.lil_matrix((num_of_vectors, max_vector_size))
for i,v in enumerate(vectors):
M[i, :v.size] = v
Then take a look at this page: http://docs.scipy.org/doc/scipy/reference/sparse.html
The lil_matrix format is good for constructing the matrix, but you'll want to convert it to a different format like csr_matrix before operating on them.

In this approach you replace the elements below your thresold by 0 and then create a sparse matrix out of them. I am suggesting the coo_matrix since it is the fastest to convert to the other types according to your purposes. Then you can scipy.sparse.vstack() them to build your matrix accounting all elements in the list:
import scipy.sparse as ss
import numpy as np
old_list = [np.random.random(100000) for i in range(5)]
threshold = 0.01
for a in old_list:
a[np.absolute(a) < threshold] = 0
old_list = [ss.coo_matrix(a) for a in old_list]
m = ss.vstack( old_list )

A little convoluted, but I would probably do it like this:
>>> import scipy.sparse as sps
>>> a = [np.arange(5), np.arange(7), np.arange(3)]
>>> lens = [len(j) for j in a]
>>> cols = np.concatenate([np.arange(j) for j in lens])
>>> rows = np.concatenate([np.repeat(j, len_) for j, len_ in enumerate(lens)])
>>> data = np.concatenate(a)
>>> b = sps.coo_matrix((data,(rows, cols)))
>>> b.toarray()
array([[0, 1, 2, 3, 4, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 0, 0, 0, 0]])

Related

Finding an index numpy python

Consider a NumPy array of shape (8, 8).
My Question: What is the index (x,y) of the 50th element?
Note: For counting the elements go row-wise.
Example, in array A, where A = [[1, 5, 9], [3, 0, 2]] the 5th element would be '0'.
Can someone explain how to find the general solution for this and, what would be the solution for this specific problem?
You can use unravel_index to find the coordinates corresponding to the index of the flattened array. Usually np.arrays start with index 0, you have to adjust for this.
import numpy as np
a = np.arange(64).reshape(8,8)
np.unravel_index(50-1, a.shape)
Out:
(6, 1)
In a NumPy array a of shape (r, c) (just like a list of lists), the n-th element is
a[(n-1) // c][(n-1) % c],
assuming that n starts from 1 as in your example.
It has nothing to do with r. Thus, when r = c = 8 and n = 50, the above formula is exactly
a[6][1].
Let me show more using your example:
from numpy import *
a = array([[1, 5, 9], [3, 0, 2]])
r = len(a)
c = len(a[0])
print(f'(r, c) = ({r}, {c})')
print(f'Shape: {a.shape}')
for n in range(1, r * c + 1):
print(f'Element {n}: {a[(n-1) // c][(n-1) % c]}')
Below is the result:
(r, c) = (2, 3)
Shape: (2, 3)
Element 1: 1
Element 2: 5
Element 3: 9
Element 4: 3
Element 5: 0
Element 6: 2
numpy.ndarray.faltten(a) returns a copy of the array a collapsed into one dimension. And please note that the counting starts from 0, therefore, in your example 0 is the 4th element and 1 is the 0th.
import numpy as np
arr = np.array([[1, 5, 9], [3, 0, 2]])
fourth_element = np.ndarray.flatten(arr)[4]
or
fourth_element = arr.flatten()[4]
the same for 8x8 matrix.
First need to create a 88 order 2d numpy array using np.array and range.Reshape created array as 88
In the output you check index of 50th element is [6,1]
import numpy as np
arr = np.array(range(1,(8*8)+1)).reshape(8,8)
print(arr[6,1])
output will be 50
or you can do it in generic way as well by the help of numpy where method.
import numpy as np
def getElementIndex(array: np.array, element):
elementIndex = np.where(array==element)
return f'[{elementIndex[0][0]},{elementIndex[1][0]}]'
def getXYOrderNumberArray(x:int, y:int):
return np.array(range(1,(x*y)+1)).reshape(x,y)
arr = getXYOrderNumberArray(8,8)
print(getElementIndex(arr,50))

What is a best way to intersect multiple arrays with numpy array?

Suppose I have an example of numpy array:
import numpy as np
X = np.array([2,5,0,4,3,1])
And I also have a list of arrays, like:
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.
Solution I have tried so far:
Sort X using X.sort().
Find locations of items of each array in X using:
locations = [np.searchsorted(X, n) for n in A]
Leave only proper ones:
masks = [X[locations[i]] == A[i] for i in range(len(A))]
result = [A[i][masks[i]] for i in range(len(A))]
But it doesn't work because locations of third array is out of bounds:
locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]
How to solve this issue?
Update
I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].
One idea would be less compute and minimal work when looping. So, here's one with those in mind -
a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]
Alternative #1 :
We can also use np.searchsorted to get the isin mask, like so -
Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a
Another way with np.intersect1d
If you are looking for the most common/elegant one, think it would be with np.intersect1d -
In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]
Solving your issue
You can also solve your out-of-bounds issue, with a simple fix -
for l in locations:
l[l==len(X)]=0
How about this, very simple and efficent:
import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]
According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)

Aggregate elements based on position vector

I'm trying to vectorize a very simple operation but can't seem to figure out how.
Given a very large numerical vector (over 1M positions) and another array of size n with a given set of positions, I would like to get back a vector of size n with elements being the average of the values of the first vector as specified by the second
a = np.array([1,2,3,4,5,6,7])
b = np.array([[0,1],[2],[3,5],[4,6]])
c = [1.5,3,5,6]
I need to repeat this operation many times so performance is an issue.
Vanilla python solution:
import numpy as np
import time
a = np.array([1,2,3,4,5,6,7])
b = np.array([[0,1],[2],[3,5],[4,6]])
begin = time.time()
for i in range(100000):
c = []
for d in b:
c.append(np.mean(a[d]))
print(time.time() - begin, c)
# 3.7529971599578857 [1.5, 3.0, 5.0, 6.0]
I'm not sure if this is necessarily faster but you may as well try:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7])
b = np.array([[0, 1], [2], [3, 5], [4, 6]])
# Get the length of each subset of indices
lens = np.fromiter((len(bi) for bi in b), count=len(b), dtype=np.int32)
# Compute reduction indices
reduce_idx = np.roll(np.cumsum(lens), 1)
reduce_idx[0] = 0
# Make flattened array of index lists
idx = np.fromiter((i for bi in b for i in bi), count=lens.sum(), dtype=np.int32)
# Reorder according to indices
a2 = a[idx]
# Sum reordered array at reduction indices and divide by number of indices
c = np.add.reduceat(a2, reduce_idx) / lens
print(c)
# [1.5 3. 5. 6. ]

Sort invariant for numpy.argsort with multiple dimensions

numpy.argsort docs state
Returns:
index_array : ndarray, int
Array of indices that sort a along the specified axis. If a is one-dimensional, a[index_array] yields a sorted a.
How can I apply the result of numpy.argsort for a multidimensional array to get back a sorted array? (NOT just a 1-D or 2-D array; it could be an N-dimensional array where N is known only at runtime)
>>> import numpy as np
>>> np.random.seed(123)
>>> A = np.random.randn(3,2)
>>> A
array([[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471],
[-0.57860025, 1.65143654]])
>>> i=np.argsort(A,axis=-1)
>>> A[i]
array([[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]],
[[ 0.2829785 , -1.50629471],
[-1.0856306 , 0.99734545]],
[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]]])
For me it's not just a matter of using sort() instead; I have another array B and I want to order B using the results of np.argsort(A) along the appropriate axis. Consider the following example:
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = np.argsort(A,axis=-1)
>>> BsortA = ???
# should result in [[4,1,3],[5,1,9]]
# so that corresponding elements of B and sort(A) stay together
It looks like this functionality is already an enhancement request in numpy.
The numpy issue #8708 has a sample implementation of take_along_axis that does what I need; I'm not sure if it's efficient for large arrays but it seems to work.
def take_along_axis(arr, ind, axis):
"""
... here means a "pack" of dimensions, possibly empty
arr: array_like of shape (A..., M, B...)
source array
ind: array_like of shape (A..., K..., B...)
indices to take along each 1d slice of `arr`
axis: int
index of the axis with dimension M
out: array_like of shape (A..., K..., B...)
out[a..., k..., b...] = arr[a..., inds[a..., k..., b...], b...]
"""
if axis < 0:
if axis >= -arr.ndim:
axis += arr.ndim
else:
raise IndexError('axis out of range')
ind_shape = (1,) * ind.ndim
ins_ndim = ind.ndim - (arr.ndim - 1) #inserted dimensions
dest_dims = list(range(axis)) + [None] + list(range(axis+ins_ndim, ind.ndim))
# could also call np.ix_ here with some dummy arguments, then throw those results away
inds = []
for dim, n in zip(dest_dims, arr.shape):
if dim is None:
inds.append(ind)
else:
ind_shape_dim = ind_shape[:dim] + (-1,) + ind_shape[dim+1:]
inds.append(np.arange(n).reshape(ind_shape_dim))
return arr[tuple(inds)]
which yields
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = A.argsort(axis=-1)
>>> take_along_axis(A,i,axis=-1)
array([[1, 2, 3],
[0, 4, 6]])
>>> take_along_axis(B,i,axis=-1)
array([[4, 1, 3],
[5, 1, 9]])
This argsort produces a (3,2) array
In [453]: idx=np.argsort(A,axis=-1)
In [454]: idx
Out[454]:
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32)
As you note applying this to A to get the equivalent of np.sort(A, axis=-1) isn't obvious. The iterative solution is sort each row (a 1d case) with:
In [459]: np.array([x[i] for i,x in zip(idx,A)])
Out[459]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
While probably not the fastest, it is probably the clearest solution, and a good starting point for conceptualizing a better solution.
The tuple(inds) from the take solution is:
(array([[0],
[1],
[2]]),
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32))
In [470]: A[_]
Out[470]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
In other words:
In [472]: A[np.arange(3)[:,None], idx]
Out[472]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
The first part is what np.ix_ would construct, but it does not 'like' the 2d idx.
Looks like I explored this topic a couple of years ago
argsort for a multidimensional ndarray
a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
I tried to explain what is going on. The take function does the same sort of thing, but constructs the indexing tuple for a more general case (dimensions and axis). Generalizing to more dimensions, but still with axis=-1 should be easy.
For the first axis, A[np.argsort(A,axis=0),np.arange(2)] works.
We just need to use advanced-indexing to index along all axes with those indices array. We can use np.ogrid to create open grids of range arrays along all axes and then replace only for the input axis with the input indices. Finally, index into data array with those indices for the desired output. Thus, essentially, we would have -
# Inputs : arr, ind, axis
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
out = arr[tuple(idx)]
Just to make it functional and do error checks, let's create two functions - One to get those indices and second one to feed in the data array and simply index. The idea with the first function is to get the indices that could be re-used for indexing into any arbitrary array which would support the necessary number of dimensions and lengths along each axis.
Hence, the implementations would be -
def advindex_allaxes(ind, axis):
axis = np.core.multiarray.normalize_axis_index(axis,ind.ndim)
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
return tuple(idx)
def take_along_axis(arr, ind, axis):
return arr[advindex_allaxes(ind, axis)]
Sample runs -
In [161]: A = np.array([[3,2,1],[4,0,6]])
In [162]: B = np.array([[3,1,4],[1,5,9]])
In [163]: i = A.argsort(axis=-1)
In [164]: take_along_axis(A,i,axis=-1)
Out[164]:
array([[1, 2, 3],
[0, 4, 6]])
In [165]: take_along_axis(B,i,axis=-1)
Out[165]:
array([[4, 1, 3],
[5, 1, 9]])
Relevant one.

Decrease array size by averaging adjacent values with numpy

I have a large array of thousands of vals in numpy. I want to decrease its size by averaging adjacent values.
For example:
a = [2,3,4,8,9,10]
#average down to 2 values here
a = [3,9]
#it averaged 2,3,4 and 8,9,10 together
So, basically, I have n number of elements in array, and I want to tell it to average down to X number of values, and it averages like above.
Is there some way to do that with numpy (already using it for other things, so I'd like to stick with it).
Using reshape and mean, you can average every m adjacent values of an 1D-array of size N*m, with N being any positive integer number. For example:
import numpy as np
m = 3
a = np.array([2, 3, 4, 8, 9, 10])
b = a.reshape(-1, m).mean(axis=1)
#array([3., 9.])
1)a.reshape(-1, m) will create a 2D image of the array without copying data:
array([[ 2, 3, 4],
[ 8, 9, 10]])
2)taking the mean in the second axis (axis=1) will then calculate the mean value of each row, resulting in:
array([3., 9.])
Try this:
n_averaged_elements = 3
averaged_array = []
a = np.array([ 2, 3, 4, 8, 9, 10])
for i in range(0, len(a), n_averaged_elements):
slice_from_index = i
slice_to_index = slice_from_index + n_averaged_elements
averaged_array.append(np.mean(a[slice_from_index:slice_to_index]))
>>>> averaged_array
>>>> [3.0, 9.0]
Looks like a simple non-overlapping moving window average to me, how about:
In [3]:
import numpy as np
a = np.array([2,3,4,8,9,10])
window_sz = 3
a[:len(a)/window_sz*window_sz].reshape(-1,window_sz).mean(1)
#you want to be sure your array can be reshaped properly, so the [:len(a)/window_sz*window_sz] part
Out[3]:
array([ 3., 9.])
In this example, I presume that a is the 1D numpy array that needs to be averaged. In the method that I give below, we first find the factors of the length of this array a. And, then we choose the an appropriate factor as the step size to average the array with.
Here is the code.
import numpy as np
from functools import reduce
''' Function to find factors of a given number 'n' '''
def factors(n):
return list(set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0))))
a = [2,3,4,8,9,10] #Given array.
'''fac: list of factors of length of a.
In this example, len(a) = 6. So, fac = [1, 2, 3, 6] '''
fac = factors(len(a))
'''step: choose an appropriate step size from the list 'fac'.
In this example, we choose one of the middle numbers in fac
(3). '''
step = fac[int( len(fac)/3 )+1]
'''avg: initialize an empty array. '''
avg = np.array([])
for i in range(0, len(a), step):
avg = np.append( avg, np.mean(a[i:i+step]) ) #append averaged values to `avg`
print avg #Prints the final result
[3.0, 9.0]

Categories

Resources