What is a best way to intersect multiple arrays with numpy array? - python

Suppose I have an example of numpy array:
import numpy as np
X = np.array([2,5,0,4,3,1])
And I also have a list of arrays, like:
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.
Solution I have tried so far:
Sort X using X.sort().
Find locations of items of each array in X using:
locations = [np.searchsorted(X, n) for n in A]
Leave only proper ones:
masks = [X[locations[i]] == A[i] for i in range(len(A))]
result = [A[i][masks[i]] for i in range(len(A))]
But it doesn't work because locations of third array is out of bounds:
locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]
How to solve this issue?
Update
I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].

One idea would be less compute and minimal work when looping. So, here's one with those in mind -
a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]
Alternative #1 :
We can also use np.searchsorted to get the isin mask, like so -
Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a
Another way with np.intersect1d
If you are looking for the most common/elegant one, think it would be with np.intersect1d -
In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]
Solving your issue
You can also solve your out-of-bounds issue, with a simple fix -
for l in locations:
l[l==len(X)]=0

How about this, very simple and efficent:
import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]
According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)

Related

Faster way than numpy.less?

i was think about if there is a faster way than the np.less functionality. Atm i'm comparing two arrays with np.less an get something like [true, true, false], after that im checking with a=np.where(x == True) to get the position of the true. After that im going through the list checking where values[a]. In my opinion there have to be a much faster way but i currently can't find on. Sorting the array is no option.
The code is is looking like this:
a = np.array([1, 2, 3])
b = np.array([2, 1, 8])
over = np.less(a, b)
# over = [True, False, True]
pos = np.where(over == True)[0]
# pos = [0, 2]
x = b[pos]
# x = [2,8]
As Ali_Sh pointed out, just use the condition to fetch the values.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([2, 1, 8])
b[np.less(a, b)]
# array([2, 8])
list(b[np.less(a, b)]) # for a list result instead of np arr
# [2, 8]
# or just
b[a < b]
# array([2, 8])
The issue is not the speed of np.less but the extra steps you've got. All of which are nearly-C-speed fast; just need to remove the unnecessary steps.
And even if you wanted to save the True/False results in over:
over = np.less(a, b)
b[over]
# array([2, 8])
over = a < b
b[over]
# array([2, 8])

Returning an numpy array of array based on a list of parameters

I have a very simple code to compute the vertical movement. I have set some initial conditions (in this case v0s). Instead to run a for loop over each one of the v0s, is that any way to "apply" each v0 to the x linspace and have a array of numpy arrays.
import numpy as np
v0s = [1, 2, 3]
g = 9.81
def VerticalPosition(v0,g,t):
return(v0*t - 0.5 * g * t**2)
def Solution(v0,g):
return(2*v0/g)
def Apex(v0,g):
return(VerticalPosition(v0,g,v0/g))
x=np.linspace(0,Solution(max(v0s),g),101)
y=[]
for v0 in v0s:
y.append(VerticalPosition(v0,g,x))
While #pekapa's answer (which returns a 2d array of floats) is what most would recommend, here is a method that produces an array of arrays.
y = np.frompyfunc(lambda a, b: VerticalPosition(a, b, x), 2, 1)(v0s, g)
Arrays of arrays are useful when the inner arrays have different shapes. (Not the case in the present example).
Re the use of x in the above expression. It is taken from the enclosing (not necessarily global) scope but that can with a bit of care be managed. The easiest is to just pack it in a function and make it explicit. Since the inner functions are evaluated immedately and then discarded x being mutable poses no problem here.
def capsule(v0s, g, x):
return np.frompyfunc(lambda a, b: VerticalPosition(a, b, x), 2, 1)(v0s, g)
Here is an example that essentially only works with an array of arrays:
a,b = np.ogrid[1:4, 5:9:2]
np.frompyfunc(np.arange, 2, 1)(a, b)
# array([[array([1, 2, 3, 4]), array([1, 2, 3, 4, 5, 6])],
# [array([2, 3, 4]), array([2, 3, 4, 5, 6])],
# [array([3, 4]), array([3, 4, 5, 6])]], dtype=object)
You just need to use all vectors, and, in your case, that's quite simple.
Try having v0s as a vector with:
v0s = np.array([[1], [2], [3]])
note that it's a 3x1 vector, v0s.shape should be (3, 1)
Your x linspace is already a vector x.shape is (101,)
Now you can just multiply them. Or, call VerticalPosition straight with your new v0s vector, i.e.
y = VerticalPosition(v0s, g, x)

list of numpy vectors to sparse array

I have a list of numpy vectors of the format:
[array([[-0.36314615, 0.80562619, -0.82777381, ..., 2.00876354,2.08571887, -1.24526026]]),
array([[ 0.9766923 , -0.05725135, -0.38505339, ..., 0.12187988,-0.83129255, 0.32003683]]),
array([[-0.59539878, 2.27166874, 0.39192573, ..., -0.73741573,1.49082653, 1.42466276]])]
here, only 3 vectors in the list are shown. I have 100s..
The maximum number of elements in one vector is around 10 million
All the arrays in the list have unequal number of elements but the maximum number of elements is fixed.
Is it possible to create a sparse matrix using these vectors in python such that I have zeros in place of elements for the vectors which are smaller than the maximum size?
Try this:
from scipy import sparse
M = sparse.lil_matrix((num_of_vectors, max_vector_size))
for i,v in enumerate(vectors):
M[i, :v.size] = v
Then take a look at this page: http://docs.scipy.org/doc/scipy/reference/sparse.html
The lil_matrix format is good for constructing the matrix, but you'll want to convert it to a different format like csr_matrix before operating on them.
In this approach you replace the elements below your thresold by 0 and then create a sparse matrix out of them. I am suggesting the coo_matrix since it is the fastest to convert to the other types according to your purposes. Then you can scipy.sparse.vstack() them to build your matrix accounting all elements in the list:
import scipy.sparse as ss
import numpy as np
old_list = [np.random.random(100000) for i in range(5)]
threshold = 0.01
for a in old_list:
a[np.absolute(a) < threshold] = 0
old_list = [ss.coo_matrix(a) for a in old_list]
m = ss.vstack( old_list )
A little convoluted, but I would probably do it like this:
>>> import scipy.sparse as sps
>>> a = [np.arange(5), np.arange(7), np.arange(3)]
>>> lens = [len(j) for j in a]
>>> cols = np.concatenate([np.arange(j) for j in lens])
>>> rows = np.concatenate([np.repeat(j, len_) for j, len_ in enumerate(lens)])
>>> data = np.concatenate(a)
>>> b = sps.coo_matrix((data,(rows, cols)))
>>> b.toarray()
array([[0, 1, 2, 3, 4, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 0, 0, 0, 0]])

first order differences along a given axis in NumPy array

#compute first differences of 1d array
from numpy import *
x = arange(10)
y = zeros(len(x))
for i in range(1,len(x)):
y[i] = x[i] - x[i-1]
print y
The above code works but there must be at least one easy, pythonesque way to do this without having to use a for loop. Any suggestions?
What about:
diff(x)
# array([1, 1, 1, 1, 1, 1, 1, 1, 1])
Yes, this exactly the kind of loop numpy elementwise operations is designed for. You just need to learn to take the right slices of the arrays.
x = numpy.arange(10)
y = numpy.zeros(x.shape)
y[1:] = x[1:] - x[:-1]
print y
several NumPy builtins will do the job--in particular, diff, ediff1d, and gradient.
i suspect ediff1d is the better choice for the specific cast described in the OP--unlike the other two, ediff1d is acdtually directed/limited to this specific use case--ie, first-order differences along a single axis (or axis of a 1D array).
>>> import numpy as NP
>>> x = NP.random.randint(1, 10, 10)
>>> x
array([4, 6, 6, 8, 1, 2, 1, 1, 5, 4])
>>> NP.ediff1d(x)
array([ 2, 0, 2, -7, 1, -1, 0, 4, -1])
Here's a pattern I used a lot for a while:
from itertools import izip
d = [a-b for a,b in izip(x[1:],x[:-1])]
y = [item - x[i - 1] for i, item in enumerate(x[1:])]
If you need to access the index of an item while looping over it, enumerate() is the Pythonic way. Also, a list comprehension is, in this case, more readable.
Moreover, you should never use wild imports (from numpy import *). It will always import more than you need and leads to unnecessary ambiguity. Rather, just import numpy or import what you need, e.g.
from numpy import arange, zeros

How to make a multidimension numpy array with a varying row size?

I would like to create a two dimensional numpy array of arrays that has a different number of elements on each row.
Trying
cells = numpy.array([[0,1,2,3], [2,3,4]])
gives an error
ValueError: setting an array element with a sequence.
We are now almost 7 years after the question was asked, and your code
cells = numpy.array([[0,1,2,3], [2,3,4]])
executed in numpy 1.12.0, python 3.5, doesn't produce any error and
cells contains:
array([[0, 1, 2, 3], [2, 3, 4]], dtype=object)
You access your cells elements as cells[0][2] # (=2) .
An alternative to tom10's solution if you want to build your list of numpy arrays on the fly as new elements (i.e. arrays) become available is to use append:
d = [] # initialize an empty list
a = np.arange(3) # array([0, 1, 2])
d.append(a) # [array([0, 1, 2])]
b = np.arange(3,-1,-1) #array([3, 2, 1, 0])
d.append(b) #[array([0, 1, 2]), array([3, 2, 1, 0])]
While Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.
If you really want flexible Numpy arrays, use something like this:
numpy.array([[0,1,2,3], [2,3,4]], dtype=object)
However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).
This isn't well supported in Numpy (by definition, almost everywhere, a "two dimensional array" has all rows of equal length). A Python list of Numpy arrays may be a good solution for you, as this way you'll get the advantages of Numpy where you can use them:
cells = [numpy.array(a) for a in [[0,1,2,3], [2,3,4]]]
Another option would be to store your arrays as one contiguous array and also store their sizes or offsets. This takes a little more conceptual thought around how to operate on your arrays, but a surprisingly large number of operations can be made to work as if you had a two dimensional array with different sizes. In the cases where they can't, then np.split can be used to create the list that calocedrus recommends. The easiest operations are ufuncs, because they require almost no modification. Here are some examples:
cells_flat = numpy.array([0, 1, 2, 3, 2, 3, 4])
# One of these is required, it's pretty easy to convert between them,
# but having both makes the examples easy
cell_lengths = numpy.array([4, 3])
cell_starts = numpy.insert(cell_lengths[:-1].cumsum(), 0, 0)
cell_lengths2 = numpy.diff(numpy.append(cell_starts, cells_flat.size))
assert np.all(cell_lengths == cell_lengths2)
# Copy prevents shared memory
cells = numpy.split(cells_flat.copy(), cell_starts[1:])
# [array([0, 1, 2, 3]), array([2, 3, 4])]
numpy.array([x.sum() for x in cells])
# array([6, 9])
numpy.add.reduceat(cells_flat, cell_starts)
# array([6, 9])
[a + v for a, v in zip(cells, [1, 3])]
# [array([1, 2, 3, 4]), array([5, 6, 7])]
cells_flat + numpy.repeat([1, 3], cell_lengths)
# array([1, 2, 3, 4, 5, 6, 7])
[a.astype(float) / a.sum() for a in cells]
# [array([ 0. , 0.16666667, 0.33333333, 0.5 ]),
# array([ 0.22222222, 0.33333333, 0.44444444])]
cells_flat.astype(float) / np.add.reduceat(cells_flat, cell_starts).repeat(cell_lengths)
# array([ 0. , 0.16666667, 0.33333333, 0.5 , 0.22222222,
# 0.33333333, 0.44444444])
def complex_modify(array):
"""Some complicated function that modifies array
pretend this is more complex than it is"""
array *= 3
for arr in cells:
complex_modify(arr)
cells
# [array([0, 3, 6, 9]), array([ 6, 9, 12])]
for arr in numpy.split(cells_flat, cell_starts[1:]):
complex_modify(arr)
cells_flat
# array([ 0, 3, 6, 9, 6, 9, 12])
In numpy 1.14.3, using append:
d = [] # initialize an empty list
a = np.arange(3) # array([0, 1, 2])
d.append(a) # [array([0, 1, 2])]
b = np.arange(3,-1,-1) #array([3, 2, 1, 0])
d.append(b) #[array([0, 1, 2]), array([3, 2, 1, 0])]
what you get an list of arrays (that can be of different lengths) and you can do operations like d[0].mean(). On the other hand,
cells = numpy.array([[0,1,2,3], [2,3,4]])
results in an array of lists.
You may want to do this:
a1 = np.array([1,2,3])
a2 = np.array([3,4])
a3 = np.array([a1,a2])
a3 # array([array([1, 2, 3]), array([3, 4])], dtype=object)
type(a3) # numpy.ndarray
type(a2) # numpy.ndarray
Slightly off-topic, but not as much as one would think because of eager mode which is now the default:
If you are using Tensorflow, you can do:
a = tf.ragged.constant([[0, 1, 2, 3]])
b = tf.ragged.constant([[2, 3, 4]])
c = tf.concat([a, b], axis=0)
And you can then do all the mathematical operations still, like tf.math.reduce_mean, etc.
np.array([[0,1,2,3], [2,3,4]], dtype=object) returns an "array" of lists.
a = np.array([np.array([0,1,2,3]), np.array([2,3,4])], dtype=object) returns an array of arrays. It allows already for operations such as a+1.
Building up on this, the functionality can be enhanced by subclassing.
import numpy as np
class Arrays(np.ndarray):
def __new__(cls, input_array, dims=None):
obj = np.array(list(map(np.array, input_array))).view(cls)
return obj
def __getitem__(self, ij):
if isinstance(ij, tuple) and len(ij) > 1:
# handle twodimensional slicing
if isinstance(ij[0],slice) or hasattr(ij[0], '__iter__'):
# [1:4,:] or [[1,2,3],[1,2]]
return Arrays(arr[ij[1]] for arr in self[ij[0]])
return self[ij[0]][ij[1]] # [1,:] np.array
return super(Arrays, self).__getitem__(ij)
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
axis = kwargs.pop('axis', None)
dimk = [len(arg) if hasattr(arg, '__iter__') else 1 for arg in inputs]
dim = max(dimk)
pad_inputs = [([i]*dim if (d<dim) else i) for d,i in zip(dimk, inputs)]
result = [np.ndarray.__array_ufunc__(self, ufunc, method, *x, **kwargs) for x in zip(*pad_inputs)]
if method == 'reduce':
# handle sum, min, max, etc.
if axis == 1:
return np.array(result)
else:
# repeat over remaining axis
return np.ndarray.__array_ufunc__(self, ufunc, method, result, **kwargs)
return Arrays(result)
Now this works:
a = Arrays([[0,1,2,3], [2,3,4]])
a[0:1,0:-1]
# Arrays([[0, 1, 2]])
np.sin(a)
# Arrays([array([0. , 0.84147098, 0.90929743, 0.14112001]),
# array([ 0.90929743, 0.14112001, -0.7568025 ])], dtype=object)
a + 2*a
# Arrays([array([0, 3, 6, 9]), array([ 6, 9, 12])], dtype=object)
To get nanfunctions working, this can be done
# patch for nanfunction that cannot handle the object-ndarrays along with second axis=-1
def nanpatch(func):
def wrapper(a, axis=None, **kwargs):
if isinstance(a, Arrays):
rowresult = [func(x, **kwargs) for x in a]
if axis == 1:
return np.array(rowresult)
else:
# repeat over remaining axis
return func(rowresult)
# otherwise keep the original version
return func(a, axis=axis, **kwargs)
return wrapper
np.nanmean = nanpatch(np.nanmean)
np.nansum = nanpatch(np.nansum)
np.nanmin = nanpatch(np.nanmin)
np.nanmax = nanpatch(np.nanmax)
np.nansum(a)
# 15
np.nansum(a, axis=1)
# array([6, 9])

Categories

Resources