I am currently writing an app in python that needs to generate large amount of random numbers, FAST. Currently I have a scheme going that uses numpy to generate all of the numbers in a giant batch (about ~500,000 at a time). While this seems to be faster than python's implementation. I still need it to go faster. Any ideas? I'm open to writing it in C and embedding it in the program or doing w/e it takes.
Constraints on the random numbers:
A Set of 7 numbers that can all have different bounds:
eg: [0-X1, 0-X2, 0-X3, 0-X4, 0-X5, 0-X6, 0-X7]
Currently I am generating a list of 7 numbers with random values from [0-1) then multiplying by [X1..X7]
A Set of 13 numbers that all add up to 1
Currently just generating 13 numbers then dividing by their sum
Any ideas? Would pre calculating these numbers and storing them in a file make this faster?
Thanks!
You can speed things up a bit from what mtrw posted above just by doing what you initially described (generating a bunch of random numbers and multiplying and dividing accordingly)...
Also, you probably already know this, but be sure to do the operations in-place (*=, /=, +=, etc) when working with large-ish numpy arrays. It makes a huge difference in memory usage with large arrays, and will give a considerable speed increase, too.
In [53]: def rand_row_doubles(row_limits, num):
....: ncols = len(row_limits)
....: x = np.random.random((num, ncols))
....: x *= row_limits
....: return x
....:
In [59]: %timeit rand_row_doubles(np.arange(7) + 1, 1000000)
10 loops, best of 3: 187 ms per loop
As compared to:
In [66]: %timeit ManyRandDoubles(np.arange(7) + 1, 1000000)
1 loops, best of 3: 222 ms per loop
It's not a huge difference, but if you're really worried about speed, it's something.
Just to show that it's correct:
In [68]: x.max(0)
Out[68]:
array([ 0.99999991, 1.99999971, 2.99999737, 3.99999569, 4.99999836,
5.99999114, 6.99999738])
In [69]: x.min(0)
Out[69]:
array([ 4.02099599e-07, 4.41729377e-07, 4.33480302e-08,
7.43497138e-06, 1.28446819e-05, 4.27614385e-07,
1.34106753e-05])
Likewise, for your "rows sum to one" part...
In [70]: def rand_rows_sum_to_one(nrows, ncols):
....: x = np.random.random((ncols, nrows))
....: y = x.sum(axis=0)
....: x /= y
....: return x.T
....:
In [71]: %timeit rand_rows_sum_to_one(1000000, 13)
1 loops, best of 3: 455 ms per loop
In [72]: x = rand_rows_sum_to_one(1000000, 13)
In [73]: x.sum(axis=1)
Out[73]: array([ 1., 1., 1., ..., 1., 1., 1.])
Honestly, even if you re-implement things in C, I'm not sure you'll be able to beat numpy by much on this one... I could be very wrong, though!
EDIT Created functions that return the full set of numbers, not just one row at a time.
EDIT 2 Make the functions more pythonic (and faster), add solution for second question
For the first set of numbers, you might consider numpy.random.randint or numpy.random.uniform, which take low and high parameters. Generating an array of 7 x 1,000,000 numbers in a specified range seems to take < 0.7 second on my 2 GHz machine:
def LimitedRandInts(XLim, N):
rowlen = (1,N)
return [np.random.randint(low=0,high=lim,size=rowlen) for lim in XLim]
def LimitedRandDoubles(XLim, N):
rowlen = (1,N)
return [np.random.uniform(low=0,high=lim,size=rowlen) for lim in XLim]
>>> import numpy as np
>>> N = 1000000 #number of randoms in each range
>>> xLim = [x*500 for x in range(1,8)] #convenient limit generation
>>> fLim = [x/7.0 for x in range(1,8)]
>>> aa = LimitedRandInts(xLim, N)
>>> ff = LimitedRandDoubles(fLim, N)
This returns integers in [0,xLim-1] or floats in [0,fLim). The integer version took ~0.3 seconds, the double ~0.66, on my 2 GHz single-core machine.
For the second set, I used #Joe Kingston's suggestion.
def SumToOneRands(NumToSum, N):
aa = np.random.uniform(low=0,high=1.0,size=(NumToSum,N)) #13 rows by 1000000 columns, for instance
s = np.reciprocal(aa.sum(0))
aa *= s
return aa.T #get back to column major order, so aa[k] is the kth set of 13 numbers
>>> ll = SumToOneRands(13, N)
This takes ~1.6 seconds.
In all cases, result[k] gives you the kth set of data.
Try r = 1664525*r + 1013904223
from "an even quicker generator"
in "Numerical Recipes in C" 2nd edition, Press et al., isbn 0521431085, p. 284.
np.random is certainly "more random"; see
Linear congruential generator .
In python, use np.uint32 like this:
python -mtimeit -s '
import numpy as np
r = 1
r = np.array([r], np.uint32)[0] # 316 py -> 16 us np
# python longs can be arbitrarily long, so slow
' '
r = r*1664525 + 1013904223 # NR2 p. 284
'
To generate big blocks at a time:
# initialize --
np.random.seed( ... )
R = np.random.randint( 0, np.iinfo( np.uint32 ).max, size, dtype=np.uint32 )
...
R *= 1664525
R += 1013904223
Making your code run in parallel certainly couldn't hurt. Try adapting it for SMP with Parallel Python
As others have already pointed out, numpy is a very good start, fast and easy to use.
If you need random numbers on a massive scale, consider eas-ecb or rc4. Both can be parallelised, you should reach performance in several GB/s.
achievable numbers posted here
If you have access to multiple cores, the computations can be done in parallel with dask.array:
import dask.array as da
x = da.random.random(size=(rows, cols)).compute()
# .compute is not necessary here, because calculations
# can continue in a lazy form and .compute is used
# on the final result
import random
for i in range(1000000):
print(random.randint(1, 1000000))
Here's a code in Python that you can use to generate one million random numbers, one per line!
Just a quick example of numpy in action:
data = numpy.random.rand(1000000)
No need for loop, you can pass in how many numbers you want to generate.
Related
Given two matrices X1 (N,3136) and X2 (M,3136) (where every element in every row is an binary number) i am trying to calculate hamming distance so that each element in X1 is compared to all of the rows from X2, such that result matrix is (N,M).
I have written two function for it (first one with help of numpy and the other one without numpy):
def hamming_distance(X, X_train):
array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])
return array
def hamming_distance2(X, X_train):
a = len(X[:,0])
b = len(X_train[:,0])
hamming_distance = np.zeros(shape=(a, b))
for i in range(0, a):
for j in range(0, b):
hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])
return hamming_distance
My problem is that upper function is much slower than lower one where I use two for loops. Is it possible to improve on first function so that I use only one loop?
PS. Sorry for my english, it isn't my first language, although I was trying to do my best!
Numpy only makes your code much faster if you use it to vectorize your work. In your case you can make use of array broadcasting to vectorize your problem: compare your two arrays and create an auxiliary array of shape (N,M,K) which you can sum along its third dimension:
hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)
We inject a singleton dimension into the first array to make it of shape (N,1,K), the second array is implicitly compatible with shape (1,M,K), so the operation can be performed.
In the comments #ayhan noted that this will create a huge auxiliary array for large M and N, which is quite true. This is the price of vectorization: you gain CPU time at the cost of memory. If you have enough memory for the above to work, it will be very fast. If you don't, you have to reduce the scope of your vectorization, and loop in either M or N (or both; this would be your current approach). But this doesn't concern numpy itself, this is about striking a balance between available resources and performance.
What you are doing is very similar to dot product. Consider these two binary arrays:
1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1
We are trying to find the number of different pairs. If you directly take the dot product, it gives you the number of (1, 1) pairs. However, if you negate one of them, it will count the different ones. For example, a1.dot(1-a2) counts (1, 0) pairs. Since we also need the number of (0, 1) pairs, we will add a2.dot(1-a1) to that. The good thing about dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out.
Here's a demo:
prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
np.allclose(res1, res2)
Out: True
And timings:
%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop
%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop
%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop
I have a numpy array embed_vec of length tot_vec in which each entry is a 3d vector:
[[ 0.52483319 0.78015841 0.71117216]
[ 0.53041481 0.79462171 0.67234534]
[ 0.53645428 0.80896727 0.63119403]
...,
[ 0.72283509 0.40070804 0.15220522]
[ 0.71277758 0.38498613 0.16141834]
[ 0.70221445 0.36918032 0.17370776]]
For each of the elements in this array, I want to find out the number of other entries which are "close" to that entry. By close, I mean that the distance between two vectors is less than a specified value R. For this, I must compare all the possible pairs in this array with each other and then find out the number of close vectors for each of the vectors in the array. So I am doing this:
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
However, this is extremely inefficient because I have two nested python loops and for larger array sizes, this takes forever. If this were in C++ or Fortran, it wouldn't have been a great issue. My question is, can one achieve the same thing using numpy efficiently using some vectorization method? As a side note, I don't mind a solution using Pandas also.
Approach #1 : Vectorized approach -
def vectorized_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
r,c = np.triu_indices(tot_vec,1)
subs = embed_vec[r] - embed_vec[c]
dists = np.einsum('ij,ij->i',subs,subs)
return np.bincount(r,dists<R**2,minlength=tot_vec)
Approach #2 : With less loop complexity (for very large arrays) -
def loopy_less_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
Rsq = R**2
out = np.zeros(tot_vec,dtype=int)
for i in range(tot_vec):
subs = embed_vec[i] - embed_vec[i+1:tot_vec]
dists = np.einsum('ij,ij->i',subs,subs)
out[i] = np.count_nonzero(dists < Rsq)
return out
Benchmarking
Original approach -
def loopy_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
return p
Timings -
In [76]: # Sample random array
...: embed_vec = np.random.rand(3000,3)
...: R = 0.5
...:
In [77]: %timeit loopy_app(embed_vec, R)
1 loops, best of 3: 50.5 s per loop
In [78]: %timeit loopy_less_app(embed_vec, R)
10 loops, best of 3: 143 ms per loop
350x+ speedup there!
Going with much bigger array with the proposed loopy_less_app -
In [81]: # Sample random array
...: embed_vec = np.random.rand(20000,3)
...: R = 0.5
...:
In [82]: %timeit loopy_less_app(embed_vec, R)
1 loops, best of 3: 4.47 s per loop
I am intrigued by that question and attempted to solve it efficintly using scipy's cKDTree. However, this approach may run out of memory because internally a list of all pairs with distance <= R is maintained. If your R and tot_vec are small enough it will work:
import numpy as np
from scipy.spatial import cKDTree as KDTree
tot_vec = 60000
embed_vec = np.random.randn(tot_vec, 3)
R = 0.1
tree = KDTree(embed_vec, leafsize=100)
p = np.zeros(tot_vec)
for pair in tree.query_pairs(R):
p[pair[0]] += 1
p[pair[1]] += 1
In case memory is an issue, with some effort it is possible to rewrite query_pairs as a generator function in Python at the cost of C performance.
first broadcast the difference:
disp_vecs=tot_vec[:,None,:]-tot_vec[None,:,:]
Now, depending on how big your dataset is, you may want to do a fist pass without all the math. If the distance is less than r, all the components should be less than r
first_mask=np.max(disp_vec, axis=-1)<r
Then do the actual calculation
disps=np.linlg.norm(disp_vec[first_mask],axis=-1)
second_mask=disps<r
Now reassign
disps=disps[second_mask]
first_mask[first_mask]=second_mask
disps are now the good values, and first_mask is a boolean mask of where they go. You can process from there.
I would like to speed up this code :
import numpy as np
import pandas as pd
a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
u1 = 13 * up + x
u.append(u1)
up = u1
for y in np.nditer(downloss[14:]):
d1 = 13 * down + y
d.append(d1)
down = d1
The data below:
0 49.00
1 48.76
2 48.52
3 48.28
...
36785758 13.88
36785759 14.65
36785760 13.19
Name: Clsprc, Length: 36785759, dtype: float64
The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?
It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.
def moving_average(a, n=14) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:
The nditer will try to provide chunks that are as large as possible to
the inner loop. By forcing ‘C’ and ‘F’ order, we get different
external loop sizes. This mode is enabled by specifying an iterator
flag.
Observe that with the default of keeping native memory order, the
iterator is able to provide a single one-dimensional chunk, whereas
when forcing Fortran order, it has to provide three chunks of two
elements each.
for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'):
# x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.
In simplified terms, I think this is what the loops are doing:
upgain=np.array([.1,.2,.3,.4])
u=[]
up=1
for x in upgain:
u1=10*up+x
u.append(u1)
up=u1
producing:
[10.1, 101.2, 1012.3, 10123.4]
np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).
https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.
To use frompyfunc, define:
def foo(x,y):return 10*x+y
The loop application (above) would be
def loopfoo(upgain,u,u1):
for x in upgain:
u1=foo(u1,x)
u.append(u1)
return u
The 'vectorized' version would be:
vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)
The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155
In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]
In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)
For this small list, loopfoo is faster (3µs v 21µs)
For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:
In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop
In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop
For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.
I have a function
def getSamples():
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
goodSamples = []
rightCount = 0
totalCount = 0
while(rightCount < 100000):
z0 = np.random.normal(5, 14)
u0 = np.random.uniform(0,k*q(z0))
if(p(z0) > u0):
goodSamples.append(z0)
rightCount += 1
totalCount += 1
return np.array(goodSamples)
My implementation to generate 100000 samples is taking much long. How can I make it fast with itertools or something similar?
I would say that the secret to making this code faster does not lie in changing the loop syntax. Here are a few points:
np.random.normal has an additional parameter size that lets you get many values at once. I would suggest using an array of say 1E09 elements and then checking your condition on that for how many are good. You can then estimate how likely that is.
To create your uniform samples, why not use sympy for symbolic evaluation of the pdf? (I don't know if this is faster but it could be since you already know the mean and variance.)
Again, for p could you use a symbolic function?
In general, performance problems are caused by doing things the "wrong way". Numpy can be very fast when used as it is designed to be used, that is by exploiting its vector processing where these vectorized operations are handed off to compiled code. Two bad practices that come from other programing languages/approaches are
Loops: Whenever you think you need a loop stop and think. Most of the time you do not and in fact do not even want one. It is much faster both to write and run code without loops.
Memory allocation: Whenever you know the size of an object, preallocate space for it. Growing memory, particularly in Python lists, is very slow compared to the alternatives.
In this case it is easy to get (approximately) two orders of magnitude speedup; the tradeoff is more memory usage.
Below is some representative code, it is not meant to be blindly used. I have not even verified it produces the correct results. It is more or less a direct translation of your routine. It appears you are drawing random numbers from a probability distribution using the rejection method. There may be more efficient algorithms to do this for your probability distribution.
def getSamples2() :
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
N = 100000 # Total number of samples we want
Ngood = 0 # Current number of good samples
goodSamples = np.zeros(N) # Storage for the good samples
while Ngood < N : # Unfortunately a loop, ....
z0 = np.random.normal(5, 14, size=N)
u0 = np.random.uniform(size=N)*k*q(z0)
ind, = np.where(p(z0) > u0)
n = min(len(ind), N-Ngood)
goodSamples[Ngood:Ngood+n] = z0[ind[:n]]
Ngood += n
return goodSamples
This generates random numbers in chunks and saves the good ones. I have not tried to optimize the chunk size (here I just use N, the total number we want, in principle this could/should be different and could even be adjusted based on the number we have left to generate). This still uses a loop, unfortunately, but now this will be run "tens" of times instead of 100,000 times. This also uses the where function and array slicing; these are good general tools to be comfortable with.
In one test with %timeit on my machine I found
In [27]: %timeit getSamples() # Original routine
1 loops, best of 3: 49.3 s per loop
In [28]: %timeit getSamples2()
1 loops, best of 3: 505 ms per loop
Here is kinda itertools "magic", but I'm not sure it can help. Probably it's much better for perfomance to prepare an numpy array (using zeros) and fill it without creating python auto-growing list. Here is both itertools and zero-preparations. (Excuse me in advance for untested code)
from itertools import count, ifilter, imap, takewhile
import operator
def getSamples():
p = lambda x : mlab.normpdf(x, 3, 2) + mlab.normpdf(x, -5, 1)
q = lambda x : mlab.normpdf(x, 5, 14)
k = 30
n = 100000
samples_iter = imap(
operator.itemgetter(1),
takewhile(
lambda i, s: i < n,
enumerate(
ifilter(lambda z: p(z) > np.random.uniform(0,k*q(z)),
(np.random.normal(5, 14) for _ in count()))
)))
goodSamples = numpy.zeros(n)
# set values from iterator, probably there is a better way for that
for i, sample in enumerate(samples_iter):
goodSamples[i] = sample
return goodSamples
Input data
Produce n matrices of a given size (here, 3x2). I also chose n=25, but I let n to lay the emphasis on the fact that what we have is a bunch of matrices.
import numpy as np
n = 25
data = np.random.rand(n, 3, 2)
This is just a format example : I can't change it. Or if I do, one must take into account the computational cost of this change.
Current implementation
What I want to achieve atomically is:
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
or, in a faster fashion:
output = [dr-d0 for (dr, d0) in zip(datum[:,0], datum[:,1:])]
Problem
This is too slow and:
output = datum[:,1:] - datum[:,0]
does not work since the behavior of the subtraction operation is not well defined in that case. Plus, this kind of slicing is not very efficient.
Cython/Nuitka/PyPy and the likes are possible solutions, but I'd like to stick with raw Numpy for now, if possible. Maybe some kind of function that can be applied on elements of the outer loop of a numpy array very quickly without the overhead of python stuff...
The np.vectorize function doesn't work on:
def get_diff(mat):
return mat[1:] - mat[0]
So I invoke ye, High Priests of Numpy, servants of Python to enlighten my poor soul!
EDIT:
XY Problem
(I didn't know it had a name)
What I actually want to do is to determine the content (read "volume") of a lot of simplices (read "tetrahedra"). The easiest and most efficient way to do it, AFAIK is to calculate:
np.linalg.det(mat[:1]-mat[0])
Then let me rephrase my question: how can I efficiently compute the content of any ensemble of simplices of dimension k using plain python and numpy?
I suggest data[:,1:] - data[:,0,None]. The None creates a new axis (officially you're supposed to use np.newaxis, which makes it very clear what you're doing), and then the subtraction will behave the way you want it to.
Correcting what I think are errors in your list comprehension:
def loop(data):
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
return output
def listcomp(data):
output = [dr-d0 for (d0, dr) in zip(data[:,0], data[:,1:])]
return output
def sub(data):
output = data[:,1:] - data[:,0,None]
return output
we have
>>> import numpy as np
>>> n = 25
>>> data = np.random.rand(n, 3, 2)
>>> res_loop = loop(data)
>>> res_listcomp = listcomp(data)
>>> res_sub = sub(data)
>>> np.allclose(res_loop, res_listcomp)
True
>>> np.allclose(res_loop, res_sub)
True
>>>
>>> %timeit loop(data)
10000 loops, best of 3: 184 µs per loop
>>> %timeit listcomp(data)
10000 loops, best of 3: 158 µs per loop
>>> %timeit sub(data)
100000 loops, best of 3: 12.8 µs per loop