I have a program that takes some large NumPy arrays and, based on some outside data, grows them by adding one to randomly selected cells until the array's sum is equal to the outside data. A simplified and smaller version looks like:
import numpy as np
my_array = np.random.random_integers(0, 100, [100, 100])
## Just creating a sample version of the array, then getting it's sum:
np.sum(my_array)
499097
So, supposing I want to grow the array until its sum is 1,000,000, and that I want to do so by repeatedly selecting a random cell and adding 1 to it until we hit that sum, I'm doing something like:
diff = 1000000 - np.sum(my_array)
counter = 0
while counter < diff:
row = random.randrange(0,99)
col = random.randrange(0,99)
coordinate = [row, col]
my_array[coord] += 1
counter += 1
Where row/col combine to return a random cell in the array, and then that cell is grown by 1. It repeats until the number of times by which it has added 1 to a random cell == the difference between the original array's sum and the target sum (1,000,000).
However, when I check the result after running this - the sum is always off. In this case after running it with the same numbers as above:
np.sum(my_array)
99667203
I can't figure out what is accounting for this massive difference. And is there a more pythonic way to go about this?
my_array[coordinate] does not do what you expect. It is selecting multiple rows and adding 1 to all of those entries. You could simply use my_array[row, col] instead.
You could simply write something like:
for _ in range(1000000 - np.sum(my_array)):
my_array[random.randrange(0, 99), random.randrange(0, 99)] += 1
(or xrange instead of range if using Python 2.x)
Replace my_array[coord] with my_array[row][col]. Your method chose two random integers and added 1 to every entry in the rows corresponding to both integers.
Basically you had a minor misunderstanding of how numpy indexes arrays.
Edit: To make this clearer.
The code posted chose two numbers, say 30 and 45, and added 1 to all 100 entries of row 30 and all 100 entries of row 45.
From this you would expect the total sum to be 100,679,697 = 200*(1,000,000 - 499,097) + 499,097
However when the random integers are identical (say, 45 and 45), only 1 is added to every entry of column 45, not 2, so in that case the sum only jumps by 100.
The problem with your original approach is that you are indexing your array with a list, which is interpreted as a sequence of indices into the row dimension, rather than as separate indices into the row/column dimensions (see here).
Try passing a tuple instead of a list:
coord = row, col
my_array[coord] += 1
A much faster approach would be to find the difference between the sum over the input array and the target value, then generate an array containing the same number of random indices into the array and increment them all in one go, thus avoiding looping in Python:
import numpy as np
def grow_to_target(A, target=1000000, inplace=False):
if not inplace:
A = A.copy()
# how many times do we need to increment A?
n = target - A.sum()
# pick n random indices into the flattened array
idx = np.random.random_integers(0, A.size - 1, n)
# how many times did we sample each unique index?
uidx, counts = np.unique(idx, return_counts=True)
# increment the array counts times at each unique index
A.flat[uidx] += counts
return A
For example:
a = np.zeros((100, 100), dtype=np.int)
b = grow_to_target(a)
print(b.sum())
# 1000000
%timeit grow_to_target(a)
# 10 loops, best of 3: 91.5 ms per loop
Related
I want to generate N arrays of fixed length n of random numbers with numpy, but arrays must have numbers varying between different ranges.
So for example, I want to generate N=100 arrays of size n=5 and each array must have its numbers between:
First number between 0 and 10
Second number between 20 and 100
and so on...
First idea that comes to my mind is doing something like:
first=np.random.randint(0,11, 100)
second=np.random.randint(20,101, 100)
...
And then I should nest them, Is there a more efficient way?
I would just put them inside another array and iterate them through their index
from np.random import randint
array_holder = [[] for i in range(N)] # Get N arrays in a holder
ab_holder = [[a1, b1], [a2, b2]]
for i in range(len(array_holder)): # You iterate over each array
[a, b] = [ab_holder[i][0], ab_holder[i][1]]
for j in range(size): # Size is the ammount of elements you want in each array
array_holder[i].append(randint(a, b)) # Where a is your base and b ends the range
Another possibility. Setting ranges indicates both what the ranges of the individual parts of each arrays must be and how many there are. size is the number of values to sample in each individual part of an array. N is the size of the Monte-Carlo sample. arrays is the result.
import numpy as np
ranges = [ (0, 10), (20, 100) ]
size = 5
N = 100
arrays = [ ]
for n in range(N):
one_array = []
for r in ranges:
chunk = np.random.randint(*r, size=size)
one_array.append(chunk)
arrays.append(one_array)
It might make an appreciable difference to use numpy's append in place of Python's but I've written this in this way to make it easier to read (and to write :)).
So I got a screen of 600 pixels. From 0 to 300 descending list must contain numbers from 10**6 to 1. I did it this way:
number = 10**6
numlist = [number]
for i in range(1, 299):
number -= 10**6/299
numlist.insert(i, number)
number = 1
numlist.insert(300,number)
For the next 300 pixels it should descent from 1 to 10**-5.
Can't figure out the right way of making this list
Since you will need floats anyway, you could use numpy.linspace. You only need to specify the first element, the last one and how many elements there should be in the array:
import numpy as np
print(np.linspace(10**6, 1, num=300))
print(np.linspace(1, 10**-5, num=300))
And since you're working with exponents, you might want an exponential distribution:
print(10**np.linspace(6, -5, num=601))
It outputs:
[ 1.00000000e+06 9.58664547e+05 9.19037713e+05 8.81048873e+05
8.44630319e+05 8.09717142e+05 7.76247117e+05 7.44160590e+05
7.13400375e+05 6.83911647e+05 6.55641849e+05 6.28540596e+05
...
1.40173742e-05 1.34379597e-05 1.28824955e-05 1.23499917e-05
1.18394992e-05 1.13501082e-05 1.08809463e-05 1.04311774e-05
1.00000000e-05]
I am building a function that I would like to creates a random 20x20 array consisting of the values 0, 1 and 2. I would secondly like to iterate through the array and keep a count of how many of each number are in the array. Here is my code:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random
def my_array():
rand_array = np.random.randint(0,3,(20,20))
zeros = 0
ones = 0
twos = 0
for element in rand_array:
if element == 0:
zeros += 1
elif element == 1:
ones += 1
else:
twos += 1
return rand_array,zeros,ones,twos
print(my_array())
When I eliminate the for loop to try and iterate the array it works fine and prints the array however as is, the code gives this error message:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
When you iterate on a multi-dimensional numpy array, you're only iterating over the first dimension. In your example, your element values will be 1-dimensional arrays too!
You could solve the issue with another for loop over the values of the 1-dimensional array, but in numpy code, using for loops is very often a bad idea. You usually want to be using vector operations and operations broadcast across the whole array instead.
In your example, you could do:
rand_array = np.random.randint(0,3,(20,20))
# no loop needed
zeros = np.sum(rand_array == 0)
ones = np.sum(rand_array == 1)
twos = np.sum(rand_array == 2)
The == operator is broadcast over the whole array producing an boolean array. Then the sum adds up the True values (True is equal to 1 in Python) to get a count.
As already pointed out you are iterating over the rows, not the elements. And numpy just refuses to evaluate the truth of an array except the array only contains one element.
Iteration over all elements
If you want to iterate over each element I would suggest using np.nditer. That way you access every element regardless of how many dimensions your array has. You just need to alter this line:
for element in np.nditer(rand_array):
# instead of "for element in rand_array:"
An alternative using a histogram
But I think there is an even better approach: If you have an array containing discrete values (like integer) you could use np.histogram to get your counts.
You need to setup the bins so that every integer will have it's own bin:
bins = np.arange(np.min(rand_array)-0.5, np.max(rand_array)+1.5)
# in your case this will give an array containing [-0.5, 0.5, 1.5, 2.5]
This way the histogram will fill the first bin with every value between -0.5 and 0.5 (so every 0 of your array), the second bin with all values between 0.5 and 1.5 (every 1), and so on. Then you call the histogram function to get the counts:
counts, _ = np.histogram(rand_array, bins=bins)
print(counts) # [130 145 125] # So 130 zeros, 145 ones, 125 twos
This approach has the advantage that you don't need to hardcode your values (because they will be calculated within the bins).
As indicated in the comments, you don't need to setup the bins as float. You could use simple integer-bins:
bins = np.arange(np.min(rand_array), np.max(rand_array)+2)
# [0 1 2 3]
counts, _ = np.histogram(rand_array, bins=bins)
print(counts) # [130 145 125]
The for loop iterates through the rows, so you have to insert another loop for every row:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random
def my_array():
rand_array = np.random.randint(0,3,(20,20))
zeros = 0
ones = 0
twos = 0
for element in rand_array:
for el in element:
if el == 0:
zeros += 1
elif el == 1:
ones += 1
else:
twos += 1
return rand_array,zeros,ones,twos
return rand_array
print(my_array())
I have a block of code that I need to optimize as much as possible since I have to run it several thousand times.
What it does is it finds the closest float in a sub-list of a given array for a random float and stores the corresponding float (ie: with the same index) stored in another sub-list of that array. It repeats the process until the sum of floats stored reaches a certain limit.
Here's the MWE to make it clearer:
import numpy as np
# Define array with two sub-lists.
a = [np.random.uniform(0., 100., 10000), np.random.random(10000)]
# Initialize empty final list.
b = []
# Run until the condition is met.
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a[1]))
# Store value located in sub-list a[0].
b.append(a[0][idx])
The code is reasonably simple but I haven't found a way to speed it up. I tried to adapt the great (and very fast) answer given in a similar question I made some time ago, to no avail.
OK, here's a slightly left-field suggestion. As I understand it, you are just trying to sample uniformally from the elements in a[0] until you have a list whose sum exceeds some limit.
Although it will be more costly memory-wise, I think you'll probably find it's much faster to generate a large random sample from a[0] first, then take the cumsum and find where it first exceeds your limit.
For example:
import numpy as np
# array of reference float values, equivalent to a[0]
refs = np.random.uniform(0, 100, 10000)
def fast_samp_1(refs, lim=10000, blocksize=10000):
# sample uniformally from refs
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# find where the cumsum first exceeds your limit
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
# # if it's ok to be just under lim rather than just over then this might
# # be quicker
# return samp[samp_sum <= lim]
Of course, if the sum of the sample of blocksize elements is < lim then this will fail to give you a sample whose sum is >= lim. You could check whether this is the case, and append to your sample in a loop if necessary.
def fast_samp_2(refs, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Note that concatenating arrays is pretty slow, so it would probably be better to make blocksize large enough to be reasonably sure that the sum of a single block will be >= to your limit, without being excessively large.
Update
I've adapted your original function a little bit so that its syntax more closely resembles mine.
def orig_samp(refs, lim=10000):
# Initialize empty final list.
b = []
a1 = np.random.random(10000)
# Run until the condition is met.
while (sum(b) < lim):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a1))
# Store value located in sub-list a[0].
b.append(refs[idx])
return b
Here's some benchmarking data.
%timeit orig_samp(refs, lim=10000)
# 100 loops, best of 3: 11 ms per loop
%timeit fast_samp_2(refs, lim=10000, blocksize=1000)
# 10000 loops, best of 3: 62.9 µs per loop
That's a good 3 orders of magnitude faster. You can do a bit better by reducing the blocksize a fraction - you basically want it to be comfortably larger than the length of the arrays you're getting out. In this case, you know that on average the output will be about 200 elements long, since the mean of all real numbers between 0 and 100 is 50, and 10000 / 50 = 200.
Update 2
It's easy to get a weighted sample rather than a uniform sample - you can just pass the p= parameter to np.random.choice:
def weighted_fast_samp(refs, weights=None, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True, p=weights)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True,
p=weights)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Write it in cython. That's going to get you a lot more for a high iteration operation.
http://cython.org/
One obvious optimization - don't re-calculate sum on each iteration, accumulate it
b_sum = 0
while b_sum<10000:
....
idx = np.argmin(np.abs(u - a[1]))
add_val = a[0][idx]
b.append(add_val)
b_sum += add_val
EDIT:
I think some minor improvement (check it out if you feel like it) may be achieved by pre-referencing sublists before the loop
a_0 = a[0]
a_1 = a[1]
...
while ...:
....
idx = np.argmin(np.abs(u - a_1))
b.append(a_0[idx])
It may save some on run time - though I don't believe it will matter that much.
Sort your reference array.
That allows log(n) lookups instead of needing to browse the whole list. (using bisect for example to find the closest elements)
For starters, I reverse a[0] and a[1] to simplify the sort:
a = np.sort([np.random.random(10000), np.random.uniform(0., 100., 10000)])
Now, a is sorted by order of a[0], meaning if you are looking for the closest value to an arbitrary number, you can start by a bisect:
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[0].
idx = bisect.bisect(a[0], u)
# now, idx can either be idx or idx-1
if idx is not 0 and np.abs(a[0][idx] - u) > np.abs(a[0][idx - 1] - u):
idx = idx - 1
# Store value located in sub-list a[1].
b.append(a[1][idx])
I need to sample uniformly at random a number from a set with fixed size, do some calculation, and put the new number back into the set. (The number samples needed is very large)
I've tried to store the numbers in a list and use random.choice() to pick an element, remove it, and then append the new element. But that's way too slow!
I'm thinking to store the numbers in a numpy array, sample a list of indices, and for each index perform the calculation.
Are there any faster way of doing this process?
Python lists are implemented internally as arrays (like Java ArrayLists, C++ std::vectors, etc.), so removing an element from the middle is relatively slow: all subsequent elements have to be reindexed. (See http://www.laurentluce.com/posts/python-list-implementation/ for more on this.) Since the order of elements doesn't seem to be relevant to you, I'd recommend you just use random.randint(0, len(L) - 1) to choose an index i, then use L[i] = calculation(L[i]) to update the ith element.
I need to sample uniformly at random a number from a set with fixed
size, do some calculation, and put the new number back into the set.
s = list(someset) # store the set as a list
while 1:
i = randrange(len(s)) # choose a random element
x = s[i]
y = your_calculation(x) # do some calculation
s[i] = y # put the new number back into the set
random.sample(
a set or list or Numpy array, Nsample )
is very fast,
but it's not clear to me if you want anything like this:
import random
Setsize = 10000
Samplesize = 100
Max = 1 << 20
bigset = set( random.sample( xrange(Max), Setsize )) # initial subset of 0 .. Max
def calc( aset ):
return set( x + 1 for x in aset ) # << your code here
# sample, calc a new subset of bigset, add it --
for iter in range(3):
asample = random.sample( bigset, Samplesize )
newset = calc( asample ) # new subset of 0 .. Max
bigset |= newset
You could use Numpy arrays
or bitarray
instead of set, but I'd expect the time in calc() to dominate.
What are your Setsize and Samplesize, roughly ?