Numba and Numpy Random Number interaction - python

I am trying to generate a bunch of random numbers quickly to do a MCMC.
I have the following benchmarks:
#njit
def getRandos(n):
for i in prange(n):
a = np.random.rand()
%timeit np.random.rand(1000000000)
13.1 s ± 287 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit getRandos(1000000000)
1.97 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Clearly the parallelization improves my runtime. However, I don't know how the seeding of the random number generation works. How can I ensure that these numbers are truly random? Do I have to randomly choose a seed somehow?

You don't have an apples to apples comparison. The first call you make np.random.rand(1000000000) is spending a ton of time creating space and storing the random numbers, while the second call getRandos(1000000000) just generates values and drops them.
Here is the apples to apples comparison (which is about the same speed):
from numba import prange, njit
import numpy as np
#njit
def getRandos(n):
a = np.zeros(n)
for i in prange(n):
a[i] = np.random.rand()
return a
%timeit -n 100 getRandos(100000)
%timeit -n 100 np.random.rand(100000)
To answer your question however, reference the numba documentation here.
They don't allow you to create individual RandomState instances, but you can set the seed inside the definition.
#njit
def getRandos(n):
np.random.seed(1111)
a = np.zeros(n)
for i in prange(n):
a[i] = np.random.rand()
return a
values = getRandos(100000)
values2 = getRandos(100000)
print(all(values == values2)) # True

Related

Can this matrix calculation be implemented or approximated without an intermediate 3D matrix?

Given an NxN matrix W, I'm looking to calculate an NxN matrix C given by the equation in this link: https://i.stack.imgur.com/dY7rY.png, or in LaTeX
$$C_{ij} = \max_k \bigg\{ \sum_l \bigg( W_{ik}W_{kl}W_{lj} - W_{ik}W_{kj} \bigg) \bigg\}.$$
I have tried to implement this in PyTorch but I've either encountered memory problems by constructing an intermediate NxNxN 3D matrix which, for large N, causes my GPU to run out of memory, or used a for-loop over k which is then very slow. I can't work out how I can get round these. How might I implement this calculation, or an approximation of it, without a large intermediate matrix like this?
Suggestions, pseudocode in any language or an implementation in any of Python/Numpy/PyTorch would be much appreciated.
The formula can be simplified as
C_ij = max_k ( W_ik M_kj)
where
M = W * W - N * W
with N the size of the matrix W and W * W the usual product.
Then, in the formula above, for every i, j there is an independent maximum to be computed. Without knowing further properties of W, it is in general not possible to further simplify the problem. So, after computing the matrix M, you can do a loop over i and j, and compute the maximum.
The first solution using Numba (You can do the same using Cython or plain C) would be to formulate the problem using simple loops.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_1(W):
C=np.empty_like(W)
N=W.shape[0]
for i in nb.prange(N):
TMP=np.empty(N,dtype=W.dtype)
for j in range(N):
for k in range(N):
acc=0
for l in range(N):
acc+=W[i,k]*W[k,l]*W[l,j]-W[i,k]*W[k,j]
TMP[k]=acc
C[i,j]=np.max(TMP)
return C
Francesco provided a simplification which scales far better for larger array sizes. This leads to the following, where I also optimized away a small temporary array.
#nb.njit(fastmath=True,parallel=True)
def calc_2(W):
C=np.empty_like(W)
N=W.shape[0]
M = np.dot(W,W) - N * W
for i in nb.prange(N):
for j in range(N):
val=W[i,0]*M[0,j]
for k in range(1,N):
TMP=W[i,k]*M[k,j]
if TMP>val:
val=TMP
C[i,j]=val
return C
This can be optimized further by partial loop unrolling and optimizing the array access. Some compilers may do this automatically.
#nb.njit(fastmath=True,parallel=True)
def calc_3(W):
C=np.empty_like(W)
N=W.shape[0]
W=np.ascontiguousarray(W)
M = np.dot(W.T,W.T) - W.shape[0] * W.T
for i in nb.prange(N//4):
for j in range(N):
val_1=W[i*4+0,0]*M[j,0]
val_2=W[i*4+1,0]*M[j,0]
val_3=W[i*4+2,0]*M[j,0]
val_4=W[i*4+3,0]*M[j,0]
for k in range(1,N):
TMP_1=W[i*4+0,k]*M[j,k]
TMP_2=W[i*4+1,k]*M[j,k]
TMP_3=W[i*4+2,k]*M[j,k]
TMP_4=W[i*4+3,k]*M[j,k]
if TMP_1>val_1:
val_1=TMP_1
if TMP_2>val_2:
val_2=TMP_2
if TMP_3>val_3:
val_3=TMP_3
if TMP_4>val_4:
val_4=TMP_4
C[i*4+0,j]=val_1
C[i*4+1,j]=val_2
C[i*4+2,j]=val_3
C[i*4+3,j]=val_4
#Remainder
for i in range(N//4*4,N):
for j in range(N):
val=W[i,0]*M[j,0]
for k in range(1,N):
TMP=W[i,k]*M[j,k]
if TMP>val:
val=TMP
C[i,j]=val
return C
Timings
W=np.random.rand(100,100)
%timeit calc_1(W)
#16.8 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(W)
#449 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#259 µs ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
W=np.random.rand(2000,2000)
#Temporary array would be 64GB in this case
%timeit calc_2(W)
#5.37 s ± 174 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#596 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python: Fastest way of packing a 2d array of binary values into UINT64 array

I have a 2D UINT8 numpy array of size (149797, 64). Each of the elements are either 0 or 1. I want to pack these binary values in each row into a UINT64 value so that i get a UINT64 array of shape 149797 as a result. I tried the following code using numpy bitpack function.
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
The packbits function takes about 10 ms to execute. A simple reshaping of this array itself seems to take around 7 ms.I also tried iterating over 2d numpy array using shifting operations to achieve the same result; but there was no speed improvement.
Finally i also want to compile it using numba for CPU.
#njit
def shifting(bitlist):
x=np.zeros(149797,dtype=np.uint64) #54
rows,cols=bitlist.shape
for i in range(0,rows): #56
out=0
for bit in range(0,cols):
out = (out << 1) | bitlist[i][bit] # If i comment out bitlist, time=190 microsec
x[i]=np.uint64(out) # Reduces time to microseconds if line is commented in njit
return x
It takes about 6 ms using njit.
Here is the parallel njit version
#njit(parallel=True)
def shifting(bitlist):
rows,cols=149797,64
out=0
z=np.zeros(rows,dtype=np.uint64)
for i in prange(rows):
for bit in range(cols):
z[i] = (z[i] * 2) + bitlist[i,bit] # Time becomes 100 micro if i use 'out' instead of 'z[i] array'
return z
It's slightly better wit 3.24ms execution time(google colab dual core 2.2Ghz)
Currently, the python solution with swapbytes(Paul's) method seems to be the best one i.e 1.74 ms.
How can we further speed up this conversion? Is there scope for using any vectorization(or parallelization), bitarrays etc, for achieving speedup?
Ref: numpy packbits pack to uint16 array
On a 12 core machine(Intel(R) Xeon(R) CPU E5-1650 v2 # 3.50GHz),
Pauls method: 1595.0 microseconds (it does not use multicore, i suppose)
Numba code: 146.0 microseconds (aforementioned parallel-numba)
i.e around 10x speedup !!!
You can get a sizeable speedup by using byteswap instead of reshaping etc.:
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
np.packbits(test).view(np.uint64).byteswap()
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
timeit(lambda:np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64),number=100)
# 1.1054180909413844
timeit(lambda:np.packbits(test).view(np.uint64).byteswap(),number=100)
# 0.18370431219227612
A bit Numba solution (version 0.46/Windows).
Code
import numpy as np
import numba as nb
#with memory allocation
#nb.njit(parallel=True)
def shifting(bitlist):
assert bitlist.shape[1]==64
x=np.empty(bitlist.shape[0],dtype=np.uint64)
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
#without memory allocation
#nb.njit(parallel=True)
def shifting_2(bitlist,x):
assert bitlist.shape[1]==64
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
Timings
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
#If you call this function multiple times, only allocating memory
#once may be enough
x=np.empty(test.shape[0],dtype=np.uint64)
#Warmup first call takes significantly longer
res=shifting(test)
res=shifting_2(test,x)
%timeit res=shifting(test)
#976 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=shifting_2(test,x)
#764 µs ± 63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.packbits(test).view(np.uint64).byteswap()
#8.07 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
#17.9 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Discrepancy in performance between log division and log subtraction using numba

I am trying to optimize some code that uses logs (the mathematical kind, not the timestamp record kind :)) and I found something strange that I haven't been able to find any answers for online. We have log(a/b) = log(a) - log(b), so I have written some code to compare the performance of the two methods.
import numpy as np
import numba as nb
# create some large random walk data
x = np.random.normal(0, 0.1, int(1e7))
x = abs(x.min()) + 100 + x # make all values >= 100
#nb.njit
def subtract_log(arr, tau):
"""arr is a numpy array, tau is an int"""
for t in range(tau, arr.shape[0]):
a = np.log(arr[t]) - np.log(arr[t - tau])
return None
#nb.njit
def divide_log(arr, tau):
"""arr is a numpy array, tau is an int"""
for t in range(tau, arr.shape[0]):
a = np.log(arr[t] / arr[t - tau])
return None
%timeit subtract_log(x, 100)
>>> 252 ns ± 0.319 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit divide_log(x, 100)
>>> 5.57 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So we see that subtracting logs is ~20,000 times faster than dividing by logs. I find this strange because I would have thought that in subtracting logs, the log series approximation would have to be calculated twice. But perhaps it's something to do with how numpy broadcasts operations?
The above example is trivial as we don't do anything with the result of the calculation. Below is a more realistic example where we return the result of the calculation.
#nb.njit
def subtract_log(arr, tau):
"""arr is a numpy array, tau is an int"""
out = np.empty(arr.shape[0] - tau)
for t in range(tau, arr.shape[0]):
f = t - tau
out[f] = np.log(arr[t]) - np.log(arr[f])
return out
#nb.njit
def divide_log(arr, tau):
"""arr is a numpy array, tau is an int"""
out = np.empty(arr.shape[0] - tau)
for t in range(tau, arr.shape[0]):
f = t - tau
out[f] = np.log(arr[t] / arr[f])
return out
out1 = subtract_log(x, 100)
out2 = divide_log(x, 100)
np.testing.assert_allclose(out1, out2, atol=1e-8) # True
%timeit subtract_log(x, 100)
>>> 129 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit divide_log(x, 100)
>>> 93.4 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now we see the times are the same order of magnitude, but subtracting logs is some 40% slower than dividing.
Can anyone explain these discrepancies?
Why is subtracting logs so much faster than dividing logs for the trivial case?
Why is subtracting logs 40% slower than dividing logs when we store the value in an array? I know there is significant setup cost in initializing an array np.empty() - initializing an array in subtract_log() in the trivial case, but without storing values in it brings the time up from 252ns to 311us.
Don't measure "useless" things, a compiler may optimize it completely away
If you turn of division by zero check (error_model="numpy"), both functions take about 280ns. Not because of fast calculation, but because they are actually doing nothing.
Optimizing away useless calculations is expected, but sometimes LLVM can't detect all of it.
In the second case you are comparing the runtime of 2 logarithms, to 1 logarithm and one division. (the substractions/additions as well as multiplications are a lot faster). There can be differences in calculation time, depending on the log implementation and the processor. But also have a look at the results, they are not exactly the same.
At least for a floa64 division (FDIV) you can have a look at the instruction tables
from Agner Fog.

Distance matrix for custom distance

From what I understand, the scipy function scipy.spatial.distance_matrix returns the Minkowski distance for any pair of vectors from the provided matrices of vectors. Is there a way to get the same result for a different distance? Something that would look like distance_matrix(X, Y, distance_function) ?
I assume that scipy does some sort of optimization under the hood. Since I am dealing with very large vectors, I would rather not lose the benefit of these optimizations by implementing my own distance_matrix function.
It is quite straight forward to implement it yourself
Also the performance will very likely be better than the distance functions already implemented in scipy.
Most of the distance functions are applying one function on all pairs and sum them up eg. (A_ik-B_jk)**n for Minkowski distance and at the end there is some other function applied eg. acc**(1/n).
Template function
You don't have to change anything here to implement various distance functions.
import numpy as np
import numba as nb
def gen_cust_dist_func(kernel_inner,kernel_outer,parallel=True):
kernel_inner_nb=nb.njit(kernel_inner,fastmath=True,inline='always')
kernel_outer_nb=nb.njit(kernel_outer,fastmath=True,inline='always')
def cust_dot_T(A,B):
assert B.shape[1]==A.shape[1]
out=np.empty((A.shape[0],B.shape[0]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
acc=0
for k in range(A.shape[1]):
acc+=kernel_inner_nb(A[i,k],B[j,k])
out[i,j]=kernel_outer_nb(acc)
return out
if parallel==True:
return nb.njit(cust_dot_T,fastmath=True,parallel=True)
else:
return nb.njit(cust_dot_T,fastmath=True,parallel=False)
Examples and Timings
#Implement for example a Minkowski distance and euclidian distance
#Minkowski distance p=20
inner=lambda A,B:(A-B)**20
outer=lambda acc:acc**(1./20)
my_minkowski_dist=gen_cust_dist_func(inner,outer,parallel=True)
#Euclidian distance
inner=lambda A,B:(A-B)**2
outer=lambda acc:np.sqrt(acc)
my_euclidian_dist=gen_cust_dist_func(inner,outer,parallel=True)
from scipy.spatial.distance import cdist
A=np.random.rand(1000,50)
B=np.random.rand(1000,50)
#Minkowski p=20
%timeit res_1=cdist(A,B,'m',p=20)
#1.44 s ± 8.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=my_minkowski_dist(A,B)
#10.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res_1=cdist(A,B,'m',p=20)
res_2=my_minkowski_dist(A,B)
print(np.allclose(res_1,res_2))
#True
#Euclidian
%timeit res_1=cdist(A,B,'euclidean')
#39.3 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2=my_euclidian_dist(A,B)
#3.61 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res_1=res_1=cdist(A,B,'euclidean')
res_2=my_euclidian_dist(A,B)
print(np.allclose(res_1,res_2))
#True

A faster numpy.polynomial?

I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?
Addendum
I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don't think that the suggested approximations would work.
You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I'm doing manually below).
numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:
def polyval(p, x):
y = np.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.
What I'm about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we'll have to assume floating-point output instead of letting the upcasting rules determine the output's dtype. However, it will speed this up slighly and make it use less memory:
def faster_polyval(p, x):
y = np.zeros(x.shape, dtype=float)
for i, v in enumerate(p):
y *= x
y += v
return y
As an example, let's say we have the following input:
# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]
# One-million element array
x = np.linspace(-10, 10, 1e6)
The results are identical:
In [3]: np_result = np.polyval(p, x)
In [4]: new_result = faster_polyval(p, x)
In [5]: np.allclose(np_result, new_result)
Out[5]: True
And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):
In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop
In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop
For really huge inputs, the memory usage difference will matter more than the speed differences. The "bare" numpy version will use ~2x more memory at peak usage than the faster_polyval version.
I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster.
And it is interesting to see that simple implementations are faster as #Joe Kington shows. (I hoped for some optimisation by numpy.)
So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.
def fastest_polyval(x, a):
y = a[-1]
for ai in a[-2::-1]:
y *= x
y += ai
return y
It avoids the initial zero array and needs one loop less.
y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1] # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)
np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)
# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources