fast python matrix creation and iteration

fast python matrix creation and iteration - python

I need to create a matrix starting from the values of a weight matrix. Which is the best structure to hold the matrix in term of speed both when creating and iterating over it? I was thinking about a list of lists or a numpy 2D array but they both seem slow to me.
What I need:
numpy array
A = np.zeros((dim, dim))
for r in range(A.shape[0]):
for c in range(A.shape[0]):
if(r==c):
A.itemset(node_degree[r])
else:
A.itemset(arc_weight[r,c])
or
list of lists
l = []
for r in range(dim):
l.append([])
for c in range(dim):
if(i==j):
l[i].append(node_degree[r])
else:
l[i].append(arc_weight[r,c])
where dim can be also 20000 , node_degree is a vector and arc_weight is another matrix. I wrote it in c++, it takes less less than 0.5 seconds while the others two in python more than 20 seconds. I know python is not c++ but I need to be as fast as possible.
Thank you all.

One thing is you shouldn't be appending to the list if you already know it's size.
Preallocate the memory first using list comprehension and generate the r, c values using xrange() instead of range() since you are using Python < 3.x (see here):
l = [[0 for c in xrange(dim)] for r in xrange(dim)]
Better yet, you can build what you need in one shot using:
l = [[node_degree[r] if r == c else arc_weight[r,c]
for c in xrange(dim)] for r in xrange(dim)]
Compared to your original implementation, this should use less memory (because of the xrange() generators), and less time because you remove the need to reallocating memory by specifying the dimensions up front.

Numpy matrices are generally faster as they know their dimensions and entry type.
In your particular situation, since you already have the arc_weight and node_degree matrices created so you can create your matrix directly from arc_weight and then replace the diagonal:
A = np.matrix(arc_matrix)
np.fill_diagonal(A, node_degree)
Another option is to replace the double loop with a function that puts the right element in each position and create a matrix from the function:
def fill_matrix(r, c):
return arc_weight[r,c] if r != c else node_degree[r]
A = np.fromfunction(fill_matrix, (dim, dim))
As a rule of thumb, with numpy you must avoid loops at all costs. First method should be faster but you should profile both to see what works for you. You should also take into account that you seem to be duplicating your data set in memory, so if it is really huge you might get in trouble. Best idea would be to create your matrix directly avoiding arc_weight and node_degree altogether.
Edit: Some simple time comparisons between list comprehension and numpy matrix creation. Since I don't know how your arc_weight and node_degree are defined, I just made up two random functions. It seems that numpy.fromfunction complains a bit if the function has a conditional on it, so I construct the matrix in two steps.
import numpy as np
def arc_weight(a,b):
return a+b
def node_degree(a):
return a*a
def create_as_list(N):
return [[arc_weight(c,r) if c!=r else node_degree(c) for c in xrange(N)] for r in xrange(N)]
def create_as_numpy(N):
A = np.fromfunction(arc_weight, (N,N))
np.fill_diagonal(A, node_degree(np.arange(N)))
return A
And here the timings for N=2000:
time A = create_as_list(2000)
CPU times: user 839 ms, sys: 16.5 ms, total: 856 ms
Wall time: 845 ms
time A = create_as_numpy(2000)
CPU times: user 83.1 ms, sys: 12.9 ms, total: 96 ms
Wall time: 95.3 ms

Make a copy of arc_weight and fill the diagonal with values from node_degree. For a 20000-by-20000 output, it takes about 1.6 seconds on my machine:
>>> import numpy
>>> dim = 20000
>>> arc_weight = numpy.arange(dim**2).reshape([dim, dim])
>>> node_degree = numpy.arange(dim)
>>> import timeit
>>> timeit.timeit('''
... A = arc_weight.copy()
... A.flat[::dim+1] = node_degree
... ''', '''
... from __main__ import dim, arc_weight, node_degree''',
... number=1)
1.6081738501125764
Once you have your array, try not to iterate over it. Compared to broadcasted operators and NumPy built-in functions, Python-level loops are a performance disaster.

Related

how to speed up loop in numpy?

I would like to speed up this code :
import numpy as np
import pandas as pd
a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
u1 = 13 * up + x
u.append(u1)
up = u1
for y in np.nditer(downloss[14:]):
d1 = 13 * down + y
d.append(d1)
down = d1
The data below:
0 49.00
1 48.76
2 48.52
3 48.28
...
36785758 13.88
36785759 14.65
36785760 13.19
Name: Clsprc, Length: 36785759, dtype: float64
The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?

It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.
def moving_average(a, n=14) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:
The nditer will try to provide chunks that are as large as possible to
the inner loop. By forcing ‘C’ and ‘F’ order, we get different
external loop sizes. This mode is enabled by specifying an iterator
flag.
Observe that with the default of keeping native memory order, the
iterator is able to provide a single one-dimensional chunk, whereas
when forcing Fortran order, it has to provide three chunks of two
elements each.
for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'):
# x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.

In simplified terms, I think this is what the loops are doing:
upgain=np.array([.1,.2,.3,.4])
u=[]
up=1
for x in upgain:
u1=10*up+x
u.append(u1)
up=u1
producing:
[10.1, 101.2, 1012.3, 10123.4]
np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).
https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.
To use frompyfunc, define:
def foo(x,y):return 10*x+y
The loop application (above) would be
def loopfoo(upgain,u,u1):
for x in upgain:
u1=foo(u1,x)
u.append(u1)
return u
The 'vectorized' version would be:
vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)
The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155
In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]
In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)
For this small list, loopfoo is faster (3µs v 21µs)
For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:
In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop
In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop
For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.

Most memory-efficient way to compute abs()**2 of complex numpy ndarray

I'm looking for the most memory-efficient way to compute the absolute squared value of a complex numpy ndarray
arr = np.empty((250000, 150), dtype='complex128') # common size
I haven't found a ufunc that would do exactly np.abs()**2.
As an array of that size and type takes up around half a GB, I'm looking for a primarily memory-efficient way.
I would also like it to be portable, so ideally some combination of ufuncs.
So far my understanding is that this should be about the best
result = np.abs(arr)
result **= 2
It will needlessly compute (**0.5)**2, but should compute **2 in-place. Altogether the peak memory requirement is only the original array size + result array size, which should be 1.5 * original array size as the result is real.
If I wanted to get rid of the useless **2 call I'd have to do something like this
result = arr.real**2
result += arr.imag**2
but if I'm not mistaken, this means I'll have to allocate memory for both the real and imaginary part calculation, so the peak memory usage would be 2.0 * original array size. The arr.real properties also return a non-contiguous array (but that is of lesser concern).
Is there anything I'm missing? Are there any better ways to do this?
EDIT 1:
I'm sorry for not making it clear, I don't want to overwrite arr, so I can't use it as out.

Thanks to numba.vectorize in recent versions of numba, creating a numpy universal function for the task is very easy:
#numba.vectorize([numba.float64(numba.complex128),numba.float32(numba.complex64)])
def abs2(x):
return x.real**2 + x.imag**2
On my machine, I find a threefold speedup compared to a pure-numpy version that creates intermediate arrays:
>>> x = np.random.randn(10000).view('c16')
>>> y = abs2(x)
>>> np.all(y == x.real**2 + x.imag**2) # exactly equal, being the same operation
True
>>> %timeit np.abs(x)**2
10000 loops, best of 3: 81.4 µs per loop
>>> %timeit x.real**2 + x.imag**2
100000 loops, best of 3: 12.7 µs per loop
>>> %timeit abs2(x)
100000 loops, best of 3: 4.6 µs per loop

EDIT: this solution has twice the minimum memory requirement, and is just marginally faster. The discussion in the comments is good for reference however.
Here's a faster solution, with the result stored in res:
import numpy as np
res = arr.conjugate()
np.multiply(arr,res,out=res)
where we exploited the property of the abs of a complex number, i.e. abs(z) = sqrt(z*z.conjugate), so that abs(z)**2 = z*z.conjugate

If your primary goal is to conserve memory, NumPy's ufuncs take an optional out parameter that lets you direct the output to an array of your choosing. It can be useful when you want to perform operations in place.
If you make this minor modification to your first method, then you can perform the operation on arr completely in place:
np.abs(arr, out=arr)
arr **= 2
One convoluted way that only uses a little extra memory could be to modify arr in place, compute the new array of real values and then restore arr.
This means storing information about the signs (unless you know that your complex numbers all have positive real and imaginary parts). Only a single bit is needed for the sign of each real or imaginary value, so this uses 1/16 + 1/16 == 1/8 the memory of arr (in addition to the new array of floats you create).
>>> signs_real = np.signbit(arr.real) # store information about the signs
>>> signs_imag = np.signbit(arr.imag)
>>> arr.real **= 2 # square the real and imaginary values
>>> arr.imag **= 2
>>> result = arr.real + arr.imag
>>> arr.real **= 0.5 # positive square roots of real and imaginary values
>>> arr.imag **= 0.5
>>> arr.real[signs_real] *= -1 # restore the signs of the real and imagary values
>>> arr.imag[signs_imag] *= -1
At the expense of storing signbits, arr is unchanged and result holds the values we want.

arr.real and arr.imag are only views into the complex array. So no additional memory is allocated.

If you don't want sqrt (what should be much heavier than multiply), then no abs.
If you don't want double memory, then no real**2 + imag**2
Then you might try this (use indexing trick)
N0 = 23
np0 = (np.random.randn(N0) + 1j*np.random.randn(N0)).astype(np.complex128)
ret_ = np.abs(np0)**2
tmp0 = np0.view(np.float64)
ret0 = np.matmul(tmp0.reshape(N0,1,2), tmp0.reshape(N0,2,1)).reshape(N0)
assert np.abs(ret_-ret0).max()<1e-7
Anyway, I prefer the numba solution

Faster looping with itertools

I have a function
def getSamples():
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
goodSamples = []
rightCount = 0
totalCount = 0
while(rightCount < 100000):
z0 = np.random.normal(5, 14)
u0 = np.random.uniform(0,k*q(z0))
if(p(z0) > u0):
goodSamples.append(z0)
rightCount += 1
totalCount += 1
return np.array(goodSamples)
My implementation to generate 100000 samples is taking much long. How can I make it fast with itertools or something similar?

I would say that the secret to making this code faster does not lie in changing the loop syntax. Here are a few points:
np.random.normal has an additional parameter size that lets you get many values at once. I would suggest using an array of say 1E09 elements and then checking your condition on that for how many are good. You can then estimate how likely that is.
To create your uniform samples, why not use sympy for symbolic evaluation of the pdf? (I don't know if this is faster but it could be since you already know the mean and variance.)
Again, for p could you use a symbolic function?

In general, performance problems are caused by doing things the "wrong way". Numpy can be very fast when used as it is designed to be used, that is by exploiting its vector processing where these vectorized operations are handed off to compiled code. Two bad practices that come from other programing languages/approaches are
Loops: Whenever you think you need a loop stop and think. Most of the time you do not and in fact do not even want one. It is much faster both to write and run code without loops.
Memory allocation: Whenever you know the size of an object, preallocate space for it. Growing memory, particularly in Python lists, is very slow compared to the alternatives.
In this case it is easy to get (approximately) two orders of magnitude speedup; the tradeoff is more memory usage.
Below is some representative code, it is not meant to be blindly used. I have not even verified it produces the correct results. It is more or less a direct translation of your routine. It appears you are drawing random numbers from a probability distribution using the rejection method. There may be more efficient algorithms to do this for your probability distribution.
def getSamples2() :
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
N = 100000 # Total number of samples we want
Ngood = 0 # Current number of good samples
goodSamples = np.zeros(N) # Storage for the good samples
while Ngood < N : # Unfortunately a loop, ....
z0 = np.random.normal(5, 14, size=N)
u0 = np.random.uniform(size=N)*k*q(z0)
ind, = np.where(p(z0) > u0)
n = min(len(ind), N-Ngood)
goodSamples[Ngood:Ngood+n] = z0[ind[:n]]
Ngood += n
return goodSamples
This generates random numbers in chunks and saves the good ones. I have not tried to optimize the chunk size (here I just use N, the total number we want, in principle this could/should be different and could even be adjusted based on the number we have left to generate). This still uses a loop, unfortunately, but now this will be run "tens" of times instead of 100,000 times. This also uses the where function and array slicing; these are good general tools to be comfortable with.
In one test with %timeit on my machine I found
In [27]: %timeit getSamples() # Original routine
1 loops, best of 3: 49.3 s per loop
In [28]: %timeit getSamples2()
1 loops, best of 3: 505 ms per loop

Here is kinda itertools "magic", but I'm not sure it can help. Probably it's much better for perfomance to prepare an numpy array (using zeros) and fill it without creating python auto-growing list. Here is both itertools and zero-preparations. (Excuse me in advance for untested code)
from itertools import count, ifilter, imap, takewhile
import operator
def getSamples():
p = lambda x : mlab.normpdf(x, 3, 2) + mlab.normpdf(x, -5, 1)
q = lambda x : mlab.normpdf(x, 5, 14)
k = 30
n = 100000
samples_iter = imap(
operator.itemgetter(1),
takewhile(
lambda i, s: i < n,
enumerate(
ifilter(lambda z: p(z) > np.random.uniform(0,k*q(z)),
(np.random.normal(5, 14) for _ in count()))
)))
goodSamples = numpy.zeros(n)
# set values from iterator, probably there is a better way for that
for i, sample in enumerate(samples_iter):
goodSamples[i] = sample
return goodSamples

mongodb to python sparse matrix, how to make it faster?

I have n documents in MongoDB containing a scipy sparse vector, stored as a pickle object and initially created with scipy.sparse.lil. The vectors are all of the same size, say p x 1.
What I need to do is to put all these vectors into a sparse n x p matrix back in python. I am using mongoengine and thus defined a property to load each pickle vector:
class MyClass(Document):
vector_text = StringField()
#property
def vector(self):
return cPickle.loads(self.vector_text)
Here's what I'm doing now, with n = 4700 and p = 67:
items = MyClass.objects()
M = items[0].vector
for item in items[1:]:
to_add = item.vector
M = scipy.sparse.hstack((M, to_add))
The loading part (i.e. calling n times the property) takes about 1.3s. The stacking part about 2.7s. Since in the future n is going to seriously increase (possibly more than a few hundred thousands), I sense that this is not optimal :)
Any idea to speed the whole thing up? If you know how to fasten the "loading" or the "stacking" only, I'm happy to hear it. For instance maybe the solution is to store the entire matrix in mongoDB? Thanks !

First, what you describe you want to do would require you using vstack, not hstack. In any case, your choice of sparse format is part of your performance problem. Try the following:
n, p = 4700, 67
csr_vecs = [sps.rand(1, p, density=0.5, format='csr') for j in xrange(n)]
lil_vecs = [vec.tolil() for vec in csr_vecs]
%timeit sps.vstack(csr_vecs, format='csr')
1 loops, best of 3: 722 ms per loop
%timeit sps.vstack(lil_vecs, format='lil')
1 loops, best of 3: 1.34 s per loop
So there's already a 2x improvement simply from swithcing to CSR. Furthermore, the stacking functions of scipy.sparse do not seem to be very optimized, definitely not for sparse vectors. The following two functions stack a list of CSR or LIL vectors, returning a CSR sparse matrix:
def csr_stack(vectors):
data = np.concatenate([vec.data for vec in vectors])
indices = np.concatenate([vec.indices for vec in vectors])
indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
vectors[0].shape[1]))
import itertools as it
def lil_stack(vectors):
indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
data = np.fromiter(it.chain(*(vec.data[0] for vec in vectors)),
dtype=vectors[0].dtype, count=indptr[-1])
indices = np.fromiter(it.chain(*(vec.rows[0] for vec in vectors)),
dtype=np.intp, count=indptr[-1])
return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
vectors[0].shape[1]))
It works:
>>> np.allclose(sps.vstack(csr_vecs).A, csr_stack(csr_vecs).A)
True
>>> np.allclose(csr_stack(csr_vecs).A, lil_stack(lil_vecs).A)
True
And is substantially faster:
%timeit csr_stack(csr_vecs)
100 loops, best of 3: 11.7 ms per loop
%timeit lil_stack(lil_vecs)
10 loops, best of 3: 37.6 ms per loop
%timeit lil_stack(lil_vecs).tolil()
10 loops, best of 3: 53.6 ms per loop
So, by switching to CSR, you can improve performance by over 100x. If you stick with LIL, your performance improvement will be only around 30x, more if you can live with CSR in the combined matrix, less if you insist on LIL.

I think, you should try to use ListField, which is essentially a python list representation of BSON array, to store your vectors. In that situation, you won't need to unpickle them every time.
class MyClass(Document):
vector = ListField()
items = MyClass.objects()
M = items[0].vector
The only problem I can see in that solution, is that you have to convert python lists to scipy sparse vector type, but I believe, that should be faster.

Fastest Way to generate 1,000,000+ random numbers in python

I am currently writing an app in python that needs to generate large amount of random numbers, FAST. Currently I have a scheme going that uses numpy to generate all of the numbers in a giant batch (about ~500,000 at a time). While this seems to be faster than python's implementation. I still need it to go faster. Any ideas? I'm open to writing it in C and embedding it in the program or doing w/e it takes.
Constraints on the random numbers:
A Set of 7 numbers that can all have different bounds:
eg: [0-X1, 0-X2, 0-X3, 0-X4, 0-X5, 0-X6, 0-X7]
Currently I am generating a list of 7 numbers with random values from [0-1) then multiplying by [X1..X7]
A Set of 13 numbers that all add up to 1
Currently just generating 13 numbers then dividing by their sum
Any ideas? Would pre calculating these numbers and storing them in a file make this faster?
Thanks!

You can speed things up a bit from what mtrw posted above just by doing what you initially described (generating a bunch of random numbers and multiplying and dividing accordingly)...
Also, you probably already know this, but be sure to do the operations in-place (*=, /=, +=, etc) when working with large-ish numpy arrays. It makes a huge difference in memory usage with large arrays, and will give a considerable speed increase, too.
In [53]: def rand_row_doubles(row_limits, num):
....: ncols = len(row_limits)
....: x = np.random.random((num, ncols))
....: x *= row_limits
....: return x
....:
In [59]: %timeit rand_row_doubles(np.arange(7) + 1, 1000000)
10 loops, best of 3: 187 ms per loop
As compared to:
In [66]: %timeit ManyRandDoubles(np.arange(7) + 1, 1000000)
1 loops, best of 3: 222 ms per loop
It's not a huge difference, but if you're really worried about speed, it's something.
Just to show that it's correct:
In [68]: x.max(0)
Out[68]:
array([ 0.99999991, 1.99999971, 2.99999737, 3.99999569, 4.99999836,
5.99999114, 6.99999738])
In [69]: x.min(0)
Out[69]:
array([ 4.02099599e-07, 4.41729377e-07, 4.33480302e-08,
7.43497138e-06, 1.28446819e-05, 4.27614385e-07,
1.34106753e-05])
Likewise, for your "rows sum to one" part...
In [70]: def rand_rows_sum_to_one(nrows, ncols):
....: x = np.random.random((ncols, nrows))
....: y = x.sum(axis=0)
....: x /= y
....: return x.T
....:
In [71]: %timeit rand_rows_sum_to_one(1000000, 13)
1 loops, best of 3: 455 ms per loop
In [72]: x = rand_rows_sum_to_one(1000000, 13)
In [73]: x.sum(axis=1)
Out[73]: array([ 1., 1., 1., ..., 1., 1., 1.])
Honestly, even if you re-implement things in C, I'm not sure you'll be able to beat numpy by much on this one... I could be very wrong, though!

EDIT Created functions that return the full set of numbers, not just one row at a time.
EDIT 2 Make the functions more pythonic (and faster), add solution for second question
For the first set of numbers, you might consider numpy.random.randint or numpy.random.uniform, which take low and high parameters. Generating an array of 7 x 1,000,000 numbers in a specified range seems to take < 0.7 second on my 2 GHz machine:
def LimitedRandInts(XLim, N):
rowlen = (1,N)
return [np.random.randint(low=0,high=lim,size=rowlen) for lim in XLim]
def LimitedRandDoubles(XLim, N):
rowlen = (1,N)
return [np.random.uniform(low=0,high=lim,size=rowlen) for lim in XLim]
>>> import numpy as np
>>> N = 1000000 #number of randoms in each range
>>> xLim = [x*500 for x in range(1,8)] #convenient limit generation
>>> fLim = [x/7.0 for x in range(1,8)]
>>> aa = LimitedRandInts(xLim, N)
>>> ff = LimitedRandDoubles(fLim, N)
This returns integers in [0,xLim-1] or floats in [0,fLim). The integer version took ~0.3 seconds, the double ~0.66, on my 2 GHz single-core machine.
For the second set, I used #Joe Kingston's suggestion.
def SumToOneRands(NumToSum, N):
aa = np.random.uniform(low=0,high=1.0,size=(NumToSum,N)) #13 rows by 1000000 columns, for instance
s = np.reciprocal(aa.sum(0))
aa *= s
return aa.T #get back to column major order, so aa[k] is the kth set of 13 numbers
>>> ll = SumToOneRands(13, N)
This takes ~1.6 seconds.
In all cases, result[k] gives you the kth set of data.

Try r = 1664525*r + 1013904223
from "an even quicker generator"
in "Numerical Recipes in C" 2nd edition, Press et al., isbn 0521431085, p. 284.
np.random is certainly "more random"; see
Linear congruential generator .
In python, use np.uint32 like this:
python -mtimeit -s '
import numpy as np
r = 1
r = np.array([r], np.uint32)[0] # 316 py -> 16 us np
# python longs can be arbitrarily long, so slow
' '
r = r*1664525 + 1013904223 # NR2 p. 284
'
To generate big blocks at a time:
# initialize --
np.random.seed( ... )
R = np.random.randint( 0, np.iinfo( np.uint32 ).max, size, dtype=np.uint32 )
...
R *= 1664525
R += 1013904223

Making your code run in parallel certainly couldn't hurt. Try adapting it for SMP with Parallel Python

As others have already pointed out, numpy is a very good start, fast and easy to use.
If you need random numbers on a massive scale, consider eas-ecb or rc4. Both can be parallelised, you should reach performance in several GB/s.
achievable numbers posted here

If you have access to multiple cores, the computations can be done in parallel with dask.array:
import dask.array as da
x = da.random.random(size=(rows, cols)).compute()
# .compute is not necessary here, because calculations
# can continue in a lazy form and .compute is used
# on the final result

import random
for i in range(1000000):
print(random.randint(1, 1000000))
Here's a code in Python that you can use to generate one million random numbers, one per line!

Just a quick example of numpy in action:
data = numpy.random.rand(1000000)
No need for loop, you can pass in how many numbers you want to generate.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fast python matrix creation and iteration - python

Related

how to speed up loop in numpy?

Most memory-efficient way to compute abs()**2 of complex numpy ndarray

Faster looping with itertools

mongodb to python sparse matrix, how to make it faster?

Fastest Way to generate 1,000,000+ random numbers in python

Categories

Resources