Understanding the runtime of numpy.where and equivalent alternatives - python

According to http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html, if x and y are given and input arrays are 1-D, where is equivalent to [xv if c else yv for (c,xv, yv) in zip(x!=0, 1/x, x)]. When doing runtime benchmarks, however, they have significantly different speeds:
x = np.array(range(-500, 500))
%timeit np.where(x != 0, 1/x, x)
10000 loops, best of 3: 23.9 µs per loop
%timeit [xv if c else yv for (c,xv, yv) in zip(x!=0, 1/x, x)]
1000 loops, best of 3: 232 µs per loop
Is there a way I can rewrite the second form so that it has a similar runtime to the first? The reason I ask is because I'd like to use a slightly modified version of the second case to avoid division by zero errors:
[1 / xv if c else xv for (c,xv) in zip(x!=0, x)]
Another question: the first case returns a numpy array while the second case returns a list. Is the most efficient way to have the second case return an array is to first make a list and then convert the list to an array?
np.array([xv if c else yv for (c,xv, yv) in zip(x!=0, 1/x, x)])
Thanks!

You just asked about 'delaying' the 'where':
numpy.where : how to delay evaluating parameters?
and someone else just asked about divide by zero:
Replace all elements of a matrix by their inverses
When people say that where is similar to the list comprehension, they attempt to describe the action, not the actual implementation.
np.where called with just one argument is the same as np.nonzero. This quickly (in compiled code) loops through the argument, and collects the indices of all non-zero values.
np.where when called with 3 arguments, returns a new array, collecting values from the 2 and 3rd arguments based on the nonzero values. But it's important to realize that those arguments must be other arrays. They are not functions that it evaluates element by element.
So the where is more like:
m1 = 1/xv
m2 = xv
[v1 if c else v2 for (c, v1, v2) in zip(x!=0, m1, m2)]
It's easy to run this iteration in compiled code because it just involves 3 arrays of matching size (matching via broadcasting).
np.array([...]) is a reasonable way of converting a list (or list comprehension) into an array. It may be a little slower than some alternatives because np.array is a powerful general purpose function. np.fromiter([], dtype) may be faster in some cases, because it isn't as general (you have to specify dtype, and it it only works with 1d).
There are 2 time proven strategies for getting more speed in element-by-element calculations:
use packages like numba and cython to rewrite the problem as c code
rework your calculations to use existing numpy methods. The use of masking to avoid divide by zero is a good example of this.
=====================
np.ma.where, the version for masked arrays is written in Python. Its code might be instructive. Note in particular this piece:
# Construct an empty array and fill it
d = np.empty(fc.shape, dtype=ndtype).view(MaskedArray)
np.copyto(d._data, xv.astype(ndtype), where=fc)
np.copyto(d._data, yv.astype(ndtype), where=notfc)
It makes a target, and then selectively copies values from the 2 inputs arrays, based on the condition array.

You can avoid division by zero while maintaining performance by using advanced indexing:
x = np.arange(-500, 500)
result = np.empty(x.shape, dtype=float) # set the dtype to whatever is appropriate
nonzero = x != 0
result[nonzero] = 1/x[nonzero]
result[~nonzero] = 0

If you for some reason want to bypass an error with numpy it might be worth looking into the errstate context:
x = np.array(range(-500, 500))
with np.errstate(divide='ignore'): #ignore zero-division error
x = 1/x
x[x!=x] = 0 #convert inf and NaN's to 0

Consider changing the array in place by using np.put():
In [56]: x = np.linspace(-1, 1, 5)
In [57]: x
Out[57]: array([-1. , -0.5, 0. , 0.5, 1. ])
In [58]: indices = np.argwhere(x != 0)
In [59]: indices
Out[59]:
array([[0],
[1],
[3],
[4]], dtype=int64)
In [60]: np.put(x, indices, 1/x[indices])
In [61]: x
Out[61]: array([-1., -2., 0., 2., 1.])
The approach above does not create a new array, which could be very convenient if x is a large array.

Related

Query an array where two other arrays align

I have 3 arrays, x, y, and q. Arrays x and y have the same length, q is a query array. Assume all values in x and q are unique. For each value of q, I would like to find the index of the corresponding value in x. I would then like to query that index in y. If a value from q does not appear in x, I would like to return np.nan.
As a concrete example, consider the following arrays:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
q = np.array([2, 0])
Since only the value 2 occurs in x, the correct return value would be:
out = np.array([5, np.nan])
With for loops, this can be done like so:
out = []
for i in range(len(q)):
for j in range(len(x)):
if np.allclose(q[i], x[j]):
out.append(y[j])
break
else:
out.append(np.nan)
output = np.array(out)
Obviously this is quite slow. Is there a simpler way to do this with numpy builtins like np.argwhere? Or would it be easier to use pandas?
Numpy broadcasting should work.
# a mask that flags any matches
m = q == x[:, None]
# replace any value in q without any match in x by np.nan
res = np.where(m.any(0), y[:, None] * m, np.nan).sum(0)
res
# array([ 5., nan])
I should note that this only works if x has no duplicates.
Because it relies on building a len(x) x len(q) array, if q is large, the above solution will run into memory issues. Another pandas solution will work much more efficiently in that case:
# map q to y via x
res = pd.Series(q).map(pd.Series(y, index=x)).values
If x and q are 2D, it's better to convert the Series.map() solution into a DataFrame.merge() one:
res = pd.DataFrame(q).merge(pd.DataFrame(x).assign(y=y), on=[0,1], how='left')['y'].values
Numpy broadcasting will blow up (will require 3D array) and will not be efficient for large arrays. Numba might do well though.
I think you could solve this in one line but using one for, and some broadcasting:
out = [y[bl].item() if bl.any() else None for bl in x[None,:]==q[:,None] ]
seems to me an elegant solution but a little confusing to read. I will go part by part.
x[None,:]==q[:,None] compares every value in q with every in x and returns (len(q),len(x) array of booleans (in this case will be [[False,True,False], [False,False,False]]
you can index y with a boolean array with same length len(y). so you could call y[ [False,True,False] ] to get the value of y[1].
If the bool array contains all false then you have to put a None so that's why to use the if-else
Here is how to use np.argwhere too. Use a more comfortable one, Pandas or numpy.
out_idx = [y[np.argwhere(x==value).reshape(-1)] for value in q]
out = [x[0] if len(x) else np.nan for x in out_idx]
Here's a way to do what your question asks:
query_results = pd.DataFrame(index=q).join(pd.DataFrame({'y':y}, index=x)).T.to_numpy()[0]
Output:
[ 5. nan]
If the performance is the main aim of this question, you can accelerate your looping code with numba library and jitting which will be very fast:
x = np.random.permutation(2000)[:1100]
y = np.random.permutation(2000)[:1100]
q = np.random.permutation(3000)[:500]
print((q > 2000).sum())
#nb.njit
def numba_(x, y, q):
out = []
for i in range(len(q)):
for j in range(len(x)):
if q[i] == x[j]:
out.append(y[j])
break
else:
out.append(np.nan)
return np.array(out)
or in parallel mode:
#nb.njit(parallel=True)
def numba_p(x, y, q):
out = np.empty(q.shape[0])
out.fill(np.nan)
for i in nb.prange(q.shape[0]):
for j in range(x.shape[0]):
if q[i] == x[j]:
out[i] = y[j]
break
return out
On large arrays it was much faster than not a robot answer (np.where) and constantstranger answer and near the same for not a robot answer (Pandas):
100 loops, best of 5: 4.4 ms per loop <-- not a robot (np.where)
100 loops, best of 5: 337 µs per loop <-- not a robot (Pandas)
100 loops, best of 5: 350 µs per loop <-- numba
100 loops, best of 5: 341 µs per loop <-- numba_p
100 loops, best of 5: 2.18 ms per loop <-- constantstranger (Pandas)
Note: np.where will be improved much in terms of performance in the new release, which can help the not a robot answer beat the constantstranger answer on larger arrays.
Update: not a robot answer (Pandas) was much faster (the fastest) on my new test on much larger arrays.

Simplifying looped np.tensordot expression

Currently, my script looks as follows:
import numpy as np
a = np.random.rand(10,5,2)
b = np.random.rand(10,5,50)
c = np.random.rand(10,2,50)
for i in range(a.shape[0]):
c[i] = np.tensordot(a[i], b[i], axes=(0,0))
I want to replicate the same behaviour without using a for loop, since it can be done in parallel. However, I have not found a neat way yet to do this with the tensordot function. Is there any way to create a one-liner for this operation?
You can use numpy.einsum function, in this case
c = np.einsum('ijk,ijl->ikl', a, b)
An alternative to einsum is matmul/#. The first array has to be transposed so the sum-of-products dimension is last:
In [162]: a = np.random.rand(10,5,2)
...: b = np.random.rand(10,5,50)
In [163]: c=a.transpose(0,2,1)#b
In [164]: c.shape
Out[164]: (10, 2, 50)
In [165]: c1 = np.random.rand(10,2,50)
...:
...: for i in range(a.shape[0]):
...: c1[i] = np.tensordot(a[i], b[i], axes=(0,0))
...:
In [166]: np.allclose(c,c1)
Out[166]: True
tensordot reshapes and transposes the arguments, reducing the task to simple dot. So while it's fine for switching which axes get the sum-of-products, it doesn't handle batches any better than dot. That's a big part of why matmul was added. np.einsum gives a same power (and more), but performance usually isn't quite as good (unless it's been "optimized" to the equivalent matmul).

NumPy: Dot product of diagonal elements

Consider x, an n x 3 vector.
Is it possible, using built-in methods of numpy or tensorflow, or any Python library, to get a vector of the order n x 1 such that each row is a vector of the order 3 x 1? That is, if x is [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]T, can a vector of the form [[1, 2, 3]T, [4, 5, 6]T, [7, 8, 9]T, [10, 11, 12]T]T be got without for loops or introducing new axes like, say, np.newaxis?
The motive behind this is to get only the diagonal elements of the dot product of x and its transpose. We could, of course, do something like np.diag(x.dot(x.T)). But, if n is significantly large, say, 202933, one can hear the CPU's fan suffering from wheezing. How to actually avoid doing the dot product of all the elements and do so of only the diagonal ones of the phantom dot product without iteration?
Let's take a look at the formula for each element in the result of multiplying x by its own transpose. I don't feel like trying to coerce the Stack Overflow UI into allowing me to use tensor notation, so we'll look conceptually.
Each element at row i, column j of the result is the dot product of row i in x and column j in x.T. Now column j in x.T is just row j in x, and the diagonal is where i and j are the same. So what you want is a sum across the rows of the squared elements of x:
d = (x * x).sum(axis=1)
To address the first part of your question, the transpose operation in numpy rarely makes a copy of your data, so x.T or np.transpose(x) are constant-time operations for even the largest arrays. The reason is that numpy arrays are stored as a block of data along with some meta-data like dimensions, strides between elements in each dimension, and data size. Transposing an array only requires you to modify a small amount of meta-data in the array object, like sizes along each dimension and strides, not copy the whole data set.
The time consuming part is performing the multiplication. Simply having the objects x and x.T costs almost nothing: they both use the same data buffer.
This function is likely one of the most efficient ways to handle this. (Taken from trimesh: https://github.com/mikedh/trimesh/blob/main/trimesh/util.py#L589)
def diagonal_dot(a, b):
"""
Dot product by row of a and b.
There are a lot of ways to do this though
performance varies very widely. This method
uses a dot product to sum the row and avoids
function calls if at all possible.
Parameters
------------
a : (m, d) float
First array
b : (m, d) float
Second array
Returns
-------------
result : (m,) float
Dot product of each row
"""
# make sure `a` is numpy array
# doing it for `a` will force the multiplication to
# convert `b` if necessary and avoid function call otherwise
a = np.asanyarray(a)
# 3x faster than (a * b).sum(axis=1)
# avoiding np.ones saves 5-10% sometimes
return np.dot(a * b, [1.0] * a.shape[1])
Comparing performance of some equivalent versions:
In [1]: import numpy as np; import trimesh
In [2]: a = np.random.random((10000, 3))
In [3]: b = np.random.random((10000, 3))
In [4]: %timeit (a * b).sum(axis=1)
1000 loops, best of 3: 181 us per loop
In [5]: %timeit np.einsum('ij,ij->i', a, b)
10000 loops, best of 3: 62.7 us per loop
In [6]: %timeit np.diag(np.dot(a, b.T))
1 loop, best of 3: 429 ms per loop
In [7]: %timeit np.dot(a * b, np.ones(a.shape[1]))
10000 loops, best of 3: 61.3 us per loop
In [8]: %timeit trimesh.util.diagonal_dot(a, b)
10000 loops, best of 3: 55.2 us per loop

Numpy array is much slower than list

Given two matrices X1 (N,3136) and X2 (M,3136) (where every element in every row is an binary number) i am trying to calculate hamming distance so that each element in X1 is compared to all of the rows from X2, such that result matrix is (N,M).
I have written two function for it (first one with help of numpy and the other one without numpy):
def hamming_distance(X, X_train):
array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])
return array
def hamming_distance2(X, X_train):
a = len(X[:,0])
b = len(X_train[:,0])
hamming_distance = np.zeros(shape=(a, b))
for i in range(0, a):
for j in range(0, b):
hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])
return hamming_distance
My problem is that upper function is much slower than lower one where I use two for loops. Is it possible to improve on first function so that I use only one loop?
PS. Sorry for my english, it isn't my first language, although I was trying to do my best!
Numpy only makes your code much faster if you use it to vectorize your work. In your case you can make use of array broadcasting to vectorize your problem: compare your two arrays and create an auxiliary array of shape (N,M,K) which you can sum along its third dimension:
hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)
We inject a singleton dimension into the first array to make it of shape (N,1,K), the second array is implicitly compatible with shape (1,M,K), so the operation can be performed.
In the comments #ayhan noted that this will create a huge auxiliary array for large M and N, which is quite true. This is the price of vectorization: you gain CPU time at the cost of memory. If you have enough memory for the above to work, it will be very fast. If you don't, you have to reduce the scope of your vectorization, and loop in either M or N (or both; this would be your current approach). But this doesn't concern numpy itself, this is about striking a balance between available resources and performance.
What you are doing is very similar to dot product. Consider these two binary arrays:
1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1
We are trying to find the number of different pairs. If you directly take the dot product, it gives you the number of (1, 1) pairs. However, if you negate one of them, it will count the different ones. For example, a1.dot(1-a2) counts (1, 0) pairs. Since we also need the number of (0, 1) pairs, we will add a2.dot(1-a1) to that. The good thing about dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out.
Here's a demo:
prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
np.allclose(res1, res2)
Out: True
And timings:
%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop
%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop
%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop

how to speed up loop in numpy?

I would like to speed up this code :
import numpy as np
import pandas as pd
a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
u1 = 13 * up + x
u.append(u1)
up = u1
for y in np.nditer(downloss[14:]):
d1 = 13 * down + y
d.append(d1)
down = d1
The data below:
0 49.00
1 48.76
2 48.52
3 48.28
...
36785758 13.88
36785759 14.65
36785760 13.19
Name: Clsprc, Length: 36785759, dtype: float64
The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?
It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.
def moving_average(a, n=14) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:
The nditer will try to provide chunks that are as large as possible to
the inner loop. By forcing ‘C’ and ‘F’ order, we get different
external loop sizes. This mode is enabled by specifying an iterator
flag.
Observe that with the default of keeping native memory order, the
iterator is able to provide a single one-dimensional chunk, whereas
when forcing Fortran order, it has to provide three chunks of two
elements each.
for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'):
# x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.
In simplified terms, I think this is what the loops are doing:
upgain=np.array([.1,.2,.3,.4])
u=[]
up=1
for x in upgain:
u1=10*up+x
u.append(u1)
up=u1
producing:
[10.1, 101.2, 1012.3, 10123.4]
np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).
https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.
To use frompyfunc, define:
def foo(x,y):return 10*x+y
The loop application (above) would be
def loopfoo(upgain,u,u1):
for x in upgain:
u1=foo(u1,x)
u.append(u1)
return u
The 'vectorized' version would be:
vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)
The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155
In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]
In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)
For this small list, loopfoo is faster (3µs v 21µs)
For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:
In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop
In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop
For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.

Categories

Resources