Question:
How to quickly add 0's to a large array (~ 600 000 entries) at the beginning to bring the length of the array to the next power of two. (2^n) Is there a faster solution besides np.concatinate()?
What I've already tried:
Using the np.concatenate(0, arr) function until the length of the array is equal to the next power of two. The code I have works, it just takes a very very long time.
Here's the pad left function:
def PadLeft(arr):
nextPower = NextPowerOfTwo(len(arr))
deficit = int(math.pow(2, nextPower) - len(arr))
#for x in range(1, int(deficit)):
for x in range(0, deficit):
arr = np.concatenate(([0], arr))
return arr
Here's the next power of two function:
def NextPowerOfTwo(number):
# Returns next power of two following 'number'
return math.ceil(math.log(number,2))
My implementation:
arr = np.ones(())
a = PadLeft(arr)
Thanks!
Rather than extending the old array in a for loop with a single element, why not add the entire set of zeroes at once?
arr = np.concatenate((np.zeros(deficit, dtype=arr.dtype), arr))
So don't use the for-loop. That's where your code is running slowly, as it is making a new array every iteration, which is far less efficient than making the required size array once and then filling it as needed, which can be done in several ways. This is just one, one that's close to your own solution.
The reason dtype=arr.dtype is added, is because np.zeros will return an array that is of the np.float dtype by default. If the datatype of arr was "less" than that (in a casting sense), the result will be cast to the "broader" datatype, being float, which is usually not what you would want (because it happens automatically).
This valid point was made by Divakar in the comments below.
There is numpy.pad which does exactly that.
For a 1D array:
arr = np.pad(arr, (deficit,0), mode='constant')
It reads as (left, right) padding.
For a 2D arrray:
arr = np.pad(arr, ((0,0), (deficit,0)), mode='constant')
The second parameter reads as ((top, bottom), (left, right)). Which pads the array with deficit to the left.
Making use of NumPy entirely, here's an approach with initialization -
def NextPowerOfTwo(number):
# Returns next power of two following 'number'
return np.ceil(np.log2(number))
def PadLeft_with_initialization(arr):
nextPower = NextPowerOfTwo(len(arr))
deficit = int(np.power(2, nextPower) - len(arr))
out = np.zeros(deficit+len(arr),dtype=arr.dtype)
out[deficit:] = arr
return out
Runtime test
Let's time the proposed solution in this post and np.concatenate based one as listed in Oliver W.'s solution :
def PadLeft_with_concatente(arr): # Oliver W.'s solution
nextPower = NextPowerOfTwo(len(arr))
deficit = int(np.power(2, nextPower) - len(arr))
return np.concatenate((np.zeros(deficit,dtype=arr.dtype), arr))
Timings -
In [226]: arr = np.random.randint(0,9,(600000))
In [227]: %timeit PadLeft_with_concatente(arr)
100 loops, best of 3: 5.21 ms per loop
In [228]: %timeit PadLeft_with_initialization(arr)
100 loops, best of 3: 6.75 ms per loop
Being cleaner and faster, I think Oliver W.'s solution with np.concatenate would be the way to go.
Related
I have a list of complex numbers for which I want to find the closest value in another list of complex numbers.
My current approach with numpy:
import numpy as np
refArray = np.random.random(16);
myArray = np.random.random(1000);
def find_nearest(array, value):
idx = (np.abs(array-value)).argmin()
return idx;
for value in np.nditer(myArray):
index = find_nearest(refArray, value);
print(index);
Unfortunately, this takes ages for a large amount of values.
Is there a faster or more "pythonian" way of matching each value in myArray to the closest value in refArray?
FYI: I don't necessarily need numpy in my script.
Important: the order of both myArray as well as refArray is important and should not be changed. If sorting is to be applied, the original index should be retained in some way.
Here's one vectorized approach with np.searchsorted based on this post -
def closest_argmin(A, B):
L = B.size
sidx_B = B.argsort()
sorted_B = B[sidx_B]
sorted_idx = np.searchsorted(sorted_B, A)
sorted_idx[sorted_idx==L] = L-1
mask = (sorted_idx > 0) & \
((np.abs(A - sorted_B[sorted_idx-1]) < np.abs(A - sorted_B[sorted_idx])) )
return sidx_B[sorted_idx-mask]
Brief explanation :
Get the sorted indices for the left positions. We do this with - np.searchsorted(arr1, arr2, side='left') or just np.searchsorted(arr1, arr2). Now, searchsorted expects sorted array as the first input, so we need some preparatory work there.
Compare the values at those left positions with the values at their immediate right positions (left + 1) and see which one is closest. We do this at the step that computes mask.
Based on whether the left ones or their immediate right ones are closest, choose the respective ones. This is done with the subtraction of indices with the mask values acting as the offsets being converted to ints.
Benchmarking
Original approach -
def org_app(myArray, refArray):
out1 = np.empty(myArray.size, dtype=int)
for i, value in enumerate(myArray):
# find_nearest from posted question
index = find_nearest(refArray, value)
out1[i] = index
return out1
Timings and verification -
In [188]: refArray = np.random.random(16)
...: myArray = np.random.random(1000)
...:
In [189]: %timeit org_app(myArray, refArray)
100 loops, best of 3: 1.95 ms per loop
In [190]: %timeit closest_argmin(myArray, refArray)
10000 loops, best of 3: 36.6 µs per loop
In [191]: np.allclose(closest_argmin(myArray, refArray), org_app(myArray, refArray))
Out[191]: True
50x+ speedup for the posted sample and hopefully more for larger datasets!
An answer that is much shorter than that of #Divakar, also using broadcasting and even slightly faster:
abs(myArray[:, None] - refArray[None, :]).argmin(axis=-1)
Given two matrices X1 (N,3136) and X2 (M,3136) (where every element in every row is an binary number) i am trying to calculate hamming distance so that each element in X1 is compared to all of the rows from X2, such that result matrix is (N,M).
I have written two function for it (first one with help of numpy and the other one without numpy):
def hamming_distance(X, X_train):
array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])
return array
def hamming_distance2(X, X_train):
a = len(X[:,0])
b = len(X_train[:,0])
hamming_distance = np.zeros(shape=(a, b))
for i in range(0, a):
for j in range(0, b):
hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])
return hamming_distance
My problem is that upper function is much slower than lower one where I use two for loops. Is it possible to improve on first function so that I use only one loop?
PS. Sorry for my english, it isn't my first language, although I was trying to do my best!
Numpy only makes your code much faster if you use it to vectorize your work. In your case you can make use of array broadcasting to vectorize your problem: compare your two arrays and create an auxiliary array of shape (N,M,K) which you can sum along its third dimension:
hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)
We inject a singleton dimension into the first array to make it of shape (N,1,K), the second array is implicitly compatible with shape (1,M,K), so the operation can be performed.
In the comments #ayhan noted that this will create a huge auxiliary array for large M and N, which is quite true. This is the price of vectorization: you gain CPU time at the cost of memory. If you have enough memory for the above to work, it will be very fast. If you don't, you have to reduce the scope of your vectorization, and loop in either M or N (or both; this would be your current approach). But this doesn't concern numpy itself, this is about striking a balance between available resources and performance.
What you are doing is very similar to dot product. Consider these two binary arrays:
1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1
We are trying to find the number of different pairs. If you directly take the dot product, it gives you the number of (1, 1) pairs. However, if you negate one of them, it will count the different ones. For example, a1.dot(1-a2) counts (1, 0) pairs. Since we also need the number of (0, 1) pairs, we will add a2.dot(1-a1) to that. The good thing about dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out.
Here's a demo:
prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
np.allclose(res1, res2)
Out: True
And timings:
%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop
%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop
%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop
I would like to speed up this code :
import numpy as np
import pandas as pd
a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
u1 = 13 * up + x
u.append(u1)
up = u1
for y in np.nditer(downloss[14:]):
d1 = 13 * down + y
d.append(d1)
down = d1
The data below:
0 49.00
1 48.76
2 48.52
3 48.28
...
36785758 13.88
36785759 14.65
36785760 13.19
Name: Clsprc, Length: 36785759, dtype: float64
The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?
It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.
def moving_average(a, n=14) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:
The nditer will try to provide chunks that are as large as possible to
the inner loop. By forcing ‘C’ and ‘F’ order, we get different
external loop sizes. This mode is enabled by specifying an iterator
flag.
Observe that with the default of keeping native memory order, the
iterator is able to provide a single one-dimensional chunk, whereas
when forcing Fortran order, it has to provide three chunks of two
elements each.
for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'):
# x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.
In simplified terms, I think this is what the loops are doing:
upgain=np.array([.1,.2,.3,.4])
u=[]
up=1
for x in upgain:
u1=10*up+x
u.append(u1)
up=u1
producing:
[10.1, 101.2, 1012.3, 10123.4]
np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).
https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.
To use frompyfunc, define:
def foo(x,y):return 10*x+y
The loop application (above) would be
def loopfoo(upgain,u,u1):
for x in upgain:
u1=foo(u1,x)
u.append(u1)
return u
The 'vectorized' version would be:
vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)
The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155
In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]
In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)
For this small list, loopfoo is faster (3µs v 21µs)
For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:
In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop
In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop
For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.
I'm looking for the most memory-efficient way to compute the absolute squared value of a complex numpy ndarray
arr = np.empty((250000, 150), dtype='complex128') # common size
I haven't found a ufunc that would do exactly np.abs()**2.
As an array of that size and type takes up around half a GB, I'm looking for a primarily memory-efficient way.
I would also like it to be portable, so ideally some combination of ufuncs.
So far my understanding is that this should be about the best
result = np.abs(arr)
result **= 2
It will needlessly compute (**0.5)**2, but should compute **2 in-place. Altogether the peak memory requirement is only the original array size + result array size, which should be 1.5 * original array size as the result is real.
If I wanted to get rid of the useless **2 call I'd have to do something like this
result = arr.real**2
result += arr.imag**2
but if I'm not mistaken, this means I'll have to allocate memory for both the real and imaginary part calculation, so the peak memory usage would be 2.0 * original array size. The arr.real properties also return a non-contiguous array (but that is of lesser concern).
Is there anything I'm missing? Are there any better ways to do this?
EDIT 1:
I'm sorry for not making it clear, I don't want to overwrite arr, so I can't use it as out.
Thanks to numba.vectorize in recent versions of numba, creating a numpy universal function for the task is very easy:
#numba.vectorize([numba.float64(numba.complex128),numba.float32(numba.complex64)])
def abs2(x):
return x.real**2 + x.imag**2
On my machine, I find a threefold speedup compared to a pure-numpy version that creates intermediate arrays:
>>> x = np.random.randn(10000).view('c16')
>>> y = abs2(x)
>>> np.all(y == x.real**2 + x.imag**2) # exactly equal, being the same operation
True
>>> %timeit np.abs(x)**2
10000 loops, best of 3: 81.4 µs per loop
>>> %timeit x.real**2 + x.imag**2
100000 loops, best of 3: 12.7 µs per loop
>>> %timeit abs2(x)
100000 loops, best of 3: 4.6 µs per loop
EDIT: this solution has twice the minimum memory requirement, and is just marginally faster. The discussion in the comments is good for reference however.
Here's a faster solution, with the result stored in res:
import numpy as np
res = arr.conjugate()
np.multiply(arr,res,out=res)
where we exploited the property of the abs of a complex number, i.e. abs(z) = sqrt(z*z.conjugate), so that abs(z)**2 = z*z.conjugate
If your primary goal is to conserve memory, NumPy's ufuncs take an optional out parameter that lets you direct the output to an array of your choosing. It can be useful when you want to perform operations in place.
If you make this minor modification to your first method, then you can perform the operation on arr completely in place:
np.abs(arr, out=arr)
arr **= 2
One convoluted way that only uses a little extra memory could be to modify arr in place, compute the new array of real values and then restore arr.
This means storing information about the signs (unless you know that your complex numbers all have positive real and imaginary parts). Only a single bit is needed for the sign of each real or imaginary value, so this uses 1/16 + 1/16 == 1/8 the memory of arr (in addition to the new array of floats you create).
>>> signs_real = np.signbit(arr.real) # store information about the signs
>>> signs_imag = np.signbit(arr.imag)
>>> arr.real **= 2 # square the real and imaginary values
>>> arr.imag **= 2
>>> result = arr.real + arr.imag
>>> arr.real **= 0.5 # positive square roots of real and imaginary values
>>> arr.imag **= 0.5
>>> arr.real[signs_real] *= -1 # restore the signs of the real and imagary values
>>> arr.imag[signs_imag] *= -1
At the expense of storing signbits, arr is unchanged and result holds the values we want.
arr.real and arr.imag are only views into the complex array. So no additional memory is allocated.
If you don't want sqrt (what should be much heavier than multiply), then no abs.
If you don't want double memory, then no real**2 + imag**2
Then you might try this (use indexing trick)
N0 = 23
np0 = (np.random.randn(N0) + 1j*np.random.randn(N0)).astype(np.complex128)
ret_ = np.abs(np0)**2
tmp0 = np0.view(np.float64)
ret0 = np.matmul(tmp0.reshape(N0,1,2), tmp0.reshape(N0,2,1)).reshape(N0)
assert np.abs(ret_-ret0).max()<1e-7
Anyway, I prefer the numba solution
This question already has answers here:
Faster alternatives to numpy.argmax/argmin which is slow
(3 answers)
Closed 6 years ago.
I'm completely new to numpy and unable to find a solution.
I have a 2d list of floating point numbers in python like:
list1[0..8][0..2]
Where e.g.:
print(list1[0][0])
> 0.1122233784
Now I want to find min and max values:
b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)
I need to do this about a million times in a loop.
It works correctly, but it's about 3x slower than my previous native python approach.
(1:15 min[numpy] vs 0:25 min[native])
What am I doing wrong?
I've read that the list conversion could be the problem, but I don't know how to do it better.
EDIT
As request some non-pseudo code, although in my script the list is created in another way.
import numpy
import random
def moonPositionNow():
#assume we read like from a file, line by line
#nextChunk = readNextLine()
#the file is build like this
#x-coord
#y-coord
#z-coord
#x-coord
#...
#but we don't have that data here, so as a **placeholder** we return a random number
nextChunk = random.random()
return nextChunk
for w in range(1000000):
list1 = [[moonPositionNow() for i in range(3)] for j in range(9)]
b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)
#Print out results
Although the list creation may be a bottle neck here I guaranty in the original code it's not the problem.
EDIT2:
Updated the example code to clarify, I don't need a numpy array of random numbers.
Since your data is available as a Python list it seems reasonable to me that a native implementation (which likely calls some optimized C code) could be faster than converting to numpy first and then calling optimized C code.
You basically loop over your data twice: once for converting the python objects to numpy arrays, and once for computing the maximum or minimum.
The native implementation (I assume it is something like calling min/max on the Python list) only needs to loop over the data once.
Furthermore, it seems that numpy's min/max functions are surprisingly slow: https://stackoverflow.com/a/12200671/3005167
The problem arises because you are passing a python list to a numpy function. The numpy function is significantly faster if you pass a numpy array as the argument.
#Create numpy numbers
nptest = np.random.uniform(size=(10000, 10))
#Create a native python list
listtest = list(nptest)
#Compare performance
%timeit np.min(nptest, axis=0)
%timeit np.min(listtest, axis=0)
Output
1000 loops, best of 3: 394 µs per loop
100 loops, best of 3: 20 ms per loop
EDIT: Added example on how to evaluate a cost function over a grid.
The following evaluates a quadratic cost function over a grid and then takes the minimum along the first axis. In particular, np.meshgrid is your friend.
def cost_function(x, y):
return x ** 2 + y ** 2
x = linspace(-1, 1)
y = linspace(-1, 1)
def eval_python(x, y):
matrix = [cost_function(_x, _y) for _x in x for _y in y]
return np.min(matrix, axis=0)
def eval_numpy(x, y):
xx, yy = np.meshgrid(x, y)
matrix = cost_function(xx, yy)
return np.min(matrix, axis=0)
%timeit eval_python(x, y)
%timeit eval_numpy(x, y)
Output
100 loops, best of 3: 13.9 ms per loop
10000 loops, best of 3: 136 µs per loop
Finally, if you cannot cast your problem in this form, you can preallocated the memory and then fill in each element.
matrix = np.empty((num_x, num_y))
for i in range(num_x):
for j in range(num_y):
matrix[i, j] = cost_function(i, j)