seed for different numpy functions - python

a = (np.random.rand(10) > 0.1).astype(int)
b = np.random.binomial(1, 0.9, 10)
c = np.random.choice([0, 1], 10, [0.1, 0.9])
There are at least 3 different ways in numpy by which I can get an array of 0 and 1 (the ones are added with a certain probability p (p=0.9 in example)). When I use np.random.seed(1), the certain method always returns the same array. However, all the above methods create different arrays even with the same seed. Is this happening because they all have different PRNG algorithms or just some of them are not affected by np.random.seed(1)?

The different approaches use the stream of pseudo random numbers in different ways, which is why they do not result in the same samples, even if you seed the generator the same way prior to each sample.
This may be clearer by considering an additional approach (used for variable b below).
np.random.seed(1)
a = (np.random.rand(10) > 0.1).astype(int)
np.random.seed(1)
b = (np.random.rand(10) < 0.9).astype(int)
Both these approaches will generate ones with probability 0.9, and they draw the same underlying numbers when calling rand (since they're seeded the same). If for example, 0.02 is sampled from the call to rand, then the corresponding element in a will be 0 and the corresponding element in b will be 1. This is analogous to the differences you're observing when using other approaches to generate the samples.

Related

Proper way to use Python Unittest on a random 2D array

Suppose i have a function that returns a 3 by 3, 2d array with random entries in a given bound:
def random3by3Matrix(smallest_num, largest_num):
matrix = [[0 for x in range(3)] for y in range(3)]
for i in range(3):
for j in range(3):
matrix[i][j] = random.randrange(int(smallest_num),
int(largest_num + 1)) if smallest_num != largest_num else smallest_num
return matrix
print(random3by3Matrix(-10, 10))
Code above returns something like this:
[[-6, 10, -4], [-10, -9, 8], [10, 1, 1]]
How would I write a unittest for a function like this? I thought of using a helper function:
def isEveryEntryGreaterEqual(list1, list2):
for i in range(len(list1)):
for j in range(len(list1[0])):
if not (list1[i][j] <= list2[i][j]):
return False
return True
class TestFunction(unittest.TestCase):
def test_random3by3Matrix(self):
lower_bound = [[-10 for x in range(3)] for y in range(3)]
upper_bound = [[10 for x in range(3)] for y in range(3)]
self.assertEqual(True, isEveryEntryGreaterEqual(lower_bound, random3by3Matrix(-10,10)))
self.assertEqual(True, isEveryEntryGreaterEqual(random3by3Matrix(-10,10), upper_bound))
But is there a cleaner way to do this?
Furthermore, how would you test that all of your values are not only between the boundaries, but also distributet randomly?
Test matrix bounds
It looks like you want to test if every single element in the matrix is greater than some value, independently of where in the matrix this element is. You can make this code shorter and more readable by e.g. extracting all the elements from the matrix and checking them all in one go, instead of the double for loop. You can easily transform any array-nesting with numpy.flatten() to a 1D array, and then test the resulting 1D array in one go with python's built-in all() method. This way, you can avoid looping over all the elements yourself:
import numpy as np
def is_matrix_in_bounds(matrix, low, high):
flat_list = np.flatten(matrix) # create a 1D list
# Each element is a boolean, that is True if it's within bounds
in_bounds = [low <= e <= high for e in flat_list]
# all() returns True if each element in in_bounds is 'True
# returns False as soon as a single element in in_bounds is False
return all(in_bounds)
class TestFunction(unittest.TestCase):
def test_random3by3Matrix(self):
lower_bound = -10
upper_bound = 10
matrix = random3by3Matrix(-10,10)
self.assertEqual(True, is_matrix_in_bounds(matrix, lower_bound, upper_bound))
If you will be using things like the matrix and the bounds in multiple tests, it may be beneficial to make them class attributes, so you don't have to define them in each test function.
Test matrix randomness
Testing if some matrix is truly randomly distributed is a bit harder, since it will involve a statistical test to check if the variables are randomly distributed or not. The best you can do here is calculate the odds that they are indeed randomly distributed, and put a threshold on how low these odds are allowed to be. Since the matrix is random and the values in the matrix do not depend on each other, you're in luck, because you can again test them as if they were a 1D distribution.
To test this, you should create a second random uniform distribution, and test the goodness of fit between your matrix and the new distribution with a Kolmogorov-Smirnov test. This considers the two distributions as random samples, and tests how likely it is that they were drawn from the same underlying distribution. In your case: a random uniform distribution. If the distributions are vastly different, it will have a very low p-value (i.e. the odds of these distributions being drawn from the same underlying distribution is low). If they are similar, the p-value will be high. You want a random matrix, so you want a high p-value. The usual cutoff for this is 0.05 (which means that 1/20 distributions will be considered non-random, because they look kinda non-random by happenstance). Python provides such a test with the scipy module. Here, you can either pass two samples (a two-sample ks test), or pass the name of some distribution and specify the parameters (a one-sample ks test). For the latter case, the distribution name should be the name of a distribution in scipy.stats, and you can pass the arguments to create such a distribution via the keyword args=().
import numpy as np
from scipy import stats
def test_matrix_randomness(matrix, low, high):
lower_bound = -10
upper_bound = 10
matrix = random3by3matrix(-10, 10)
# two-sample test
random_dist = np.random.random_integers(low=lower_bound, high=upper_bound, size=3*3)
statistic, p_value = stats.kstest(random_dist, np.flatten(matrix))
# one-sample test, equivalent, but neater
# doesn't require creating a second distribution
statistic, p_value = stats.kstest(random_dist, "randint", args=(-10, 10))
self.assertEqual(True, p_value > 0.05)
Note that unittests with a random aspect will sometimes fail. Such is the nature of randomness.
see:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.randint.html#scipy.stats.randint

Sample with replacement in python with limit on number of samples from each class

Python has random.choices, which lets you sample from a population with replacement, assigning different weights to different members of the population. Similarly, numpy has np.random.choice which is similar (though it annoyingly forces you to normalize your weights into a probability distribution). But neither function allows you to restrict how many times each member of the population can occur in the sample. This is necessary if the members represent subpopulations and you are doing hierarchical sampling where you want to sample the members of each subpopulation without replacement since otherwise you might choose a subpopulation more times than it has members. Is there some other library function that can do this?
I don't know of a function specifically for that, the requirement is a little unusual. You can use NumPy to repeat each member of a population some number of times in a new population to be sampled without replacement:
import numpy as np
x = [1, 2, 3, 4]
weights = [.2, .5, .2, .1]
replace_limits = [8, 2, 4, 10]
repeated_x = np.repeat(x, replace_limits)
new_weights = np.repeat(weights, replace_limits)
new_weights /= np.sum(new_weights)
sample = np.random.choice(repeated_x, size=10, replace=False, p=new_weights)
This isn't the most memory-efficient option, but that would require writing the algorithm in a lower-level language.

In Python, how can I distribute a set of numbers randomly onto a grid/matrix?

I have the following problem:
I want to generate a 100x100 grid (numpy.ndarray) by filling it with numbers, out from a given list ([-1,0,1,2]). I want to distribute them randomly on this grid. Also, the numbers must maintain the following ratios: the number 0 must occupy 10% of the grid, while the remaining numbers have a 30% ratio each, so their sum equals 100%. Using np.random.choice() I was able to generate random numbers, each distributed with the associated probabilities. However, I run into problems because I have to make sure that the number 0 makes exactly 10% of the entire grid, and the non-zero numbers exactly 30% each. Using the np.random.choice() function, this is not always the case (especially if the sample size is small), because I have only assigned probabilities, and not ratios:
import numpy as np
numbers = np.random.choice([-1,0,1,2],(100,100),p=[0.3,0.1,0.3,0.3])
print(np.count_nonzero(numbers)) #must be = 0.1 always!
Another idea I had was to initially set the entire matrix as np.zeros((100,100)) and then fill up only 90% of it with non-zero elements, however, I don't how to approach this problem such that the numbers are distributed randomly on the grid, i.e., random location/index.
Edit: The ratio of each individual non-zero number in the grid will only depend on how many cells I want to be empty, or 0 in that case. All other non-zero elements must have the same ratio. For example, I want to have 20% of the grid to be zeros, the remaining numbers will have a ratio of (1 - ratio_of_zero)/amount_of_non-zero_elements.
This should do what you want it to (suggested by RemcoGerlich), though I don't know how efficient this method is:
import numpy as np
# Constants
SHAPE = (100, 100)
LENGTH = SHAPE[0] * SHAPE[1]
REST = [-1, 1, 2]
ZERO_PROB = 10
BASE_PROB = (100 - ZERO_PROB) // len(REST)
NUM_ZERO = round(LENGTH * (ZERO_PROB / 100))
NUM_REST = round(LENGTH * (BASE_PROB / 100))
# Make base 1D array
base_arr = [0 for _ in range(NUM_ZERO)]
for i in REST:
base_arr += [i for _ in range(NUM_REST)]
base_arr = np.array(base_arr)
# Give it a random order
np.random.shuffle(base_arr)
# Finally, reshape the array
result_arr = base_arr.reshape(SHAPE)
Looking at your comment, for flexibility that depends on how many of the numbers are to have different probabilities I suppose. You could just have a for loop which goes through and makes an array the right length for each one to add to the base_arr. Also, this can of course be a function you pass variables into rather than just a script with hard coded constants like this.
Edited based on comment.

building PyMC3 model incorporating different measurements

I am trying to incorporate different types and replicates of measurements into one model in PyMC3.
Consider the following model: P(t)=P0*exp(-kBt) where P(t), P0, and B are concentrations. k is a rate. We measure P(t) at different times and B once, all through counting of particles. k is the parameter of interest we are trying to infer.
My question has two parts:
(1) How to incorporate measurements on P(t) and B into one model?
(2) How to use a variable number of replicate experiments to inform on the value of k?
I think I can answer part (1), but am unsure about whether it is right or done in the right flavour. I failed to generalise the code to include a variable number of replicates.
For one experiment (one replicate):
ts=np.asarray([time0,time1,...])
counts=np.asarray([countforB,countforPattime0,countforPattime1,...])
basic_model = pm.Model()
with basic_model:
k=pm.Uniform('k',0,20)
B=pm.Uniform('B',0,1000)
P=pm.Uniform('P',0,1000)
exprate=pm.Deterministic('exprate',k*B)
modelmu=pm.math.concatenate(B*(np.asarray([1.0]),P*pm.math.exp(-exprate*ts)))
Y_obs=pm.Poisson('Y_obs',mu=modelmu,observed=counts))
I tried to include different replicates along the lines of the above, but to no avail:
...
k=pm.Uniform('k',0,20) # same within replicates
B=pm.Uniform('B',0,1000,shape=numrepl) # can vary between expts.
P=pm.Uniform('P',0,1000,shape=numrepl) # can vary between expts.
exprate=???
modelmu=???
Multiple Observables
PyMC3 supports multiple observables, that is, you can add multiple RandomVariable objects to the graph with the observed argument set.
Single Trial
In your first case, this would lend some clarity to the model:
counts=[countforPattime0, countforPattime1, ...]
with pm.Model() as single_trial:
# priors
k = pm.Uniform('k', 0, 20)
B = pm.Uniform('B', 0, 1000)
P = pm.Uniform('P', 0, 1000)
# transformed RVs
rate = pm.Deterministic('exprate', k*B)
mu = P*pm.math.exp(-rate*ts)
# observations
B_obs = pm.Poisson('B_obs', mu=B, observed=countforB)
Y_obs = pm.Poisson('Y_obs', mu=mu, observed=counts)
Multiple Trials
With this additional flexibility, hopefully it makes the transition to multiple trials more obvious. It should go something like:
B_cts = np.array(...) # shape (N, 1)
Y_cts = np.array(...) # shape (N, M)
ts = np.array(...) # shape (1, M)
with pm.Model() as multi_trial:
# priors
k = pm.Uniform('k', 0, 20)
B = pm.Uniform('B', 0, 1000, shape=B_cts.shape)
P = pm.Uniform('P', 0, 1000, shape=B_cts.shape)
# transformed RVs
rate = pm.Deterministic('exprate', k*B)
mu = P*pm.math.exp(-rate*ts)
# observations
B_obs = pm.Poisson('B_obs', mu=B, observed=B_cts)
Y_obs = pm.Poisson('Y_obs', mu=mu, observed=Y_cts)
There might be some extra syntax stuff to get the matrices multiplying correctly, but this at least includes the correct shapes.
Priors
Once you get that setup working, it would be in your interest to reconsider the priors. I suspect you have more information about the typical values for those than is currently included, especially since this seems like a chemical or physical model.
For instance, right now the model says,
We believe the true value of B remains fixed for the duration of a trial, but across trials is a completely arbitrary value between 0 and 1000, and measuring it repeatedly within a trial would be Poisson distributed.
Typically, one should avoid truncations unless they are excluding meaningless values. Hence, a lower bound of 0 is fine, but the upper bounds are arbitrary. I'd recommend having a look at the Stan Wiki on choosing priors.

Vector sum of multidimensional arrays in numpy

If I have a an N^3 array of triplets in a numpy array, how do I do a vector sum on all of the triplets in the array? For some reason I just can't wrap my brain around the summation indices. Here is what I tried, but it doesn't seem to work:
a = np.random.random((5,5,5,3)) - 0.5
s = a.sum((0,1,2))
np.linalg.norm(s)
I would expect that as N gets large, if the sum is working correctly I should converge to 0, but I just keep getting bigger. The sum gives me a vector that is the correct shape (3x1), but obviously I must be doing something wrong. I know this should be easy, but I'm just not getting it.
Thanks in advance!
Is is easier to understand you problem analytically if instead of uniform random numbers we use standard normal numbers, and the qualitative results can be applied to your particular case:
>>> a = np.random.normal(0, 1, size=(5, 5, 5, 3))
>>> s = a.sum(axis=(0, 1, 2))
So now each of the three items of s is the sum of 125 numbers, each drawn from a standard normal distribution. It is a well established fact that adding up two normal distributions gives you another normal distribution with mean the sum of the means, and variance the sum of the variances. So each of the three values in s will be distributed as a random sample from a normal distribution with mean 0 and standard deviation sqrt(125) = 11.18.
The fact that the variance of the distribution grows means that, even though if you run your code many times, you will see an average value of 0 for each of those numbers, on any given run you are more likely to see larger offsets from 0.
Furthermore you then go and compute the norm of those three values. Squaring three standard normal distributions and adding them together gives you a chi-squared distribution. If you then take the square root, you get a chi distribution. The former is easier to deal with, and it predicts that the average value of the square of the norm of your three values will be 3 * 125. And it most certainly seems to be:
>>> mean_norm_sq = 0
>>> for n in xrange(1000):
... a = np.random.normal(0, 1, size=(5, 5, 5, 3))
... s = a.sum(axis=(0, 1, 2))
... mean_norm_sq += np.sum(s**2)
...
>>> mean_norm_sq / 1000
374.47629802482447
As the comments note, there is no reason why the squared sum should approach zero. By the description, an array of N three-dimensional vectors sounds like it should have the shape of (N,3) not (N,N,N,3), but I may be misunderstanding it. Either way, it is simple to observe what happens in the two cases:
import numpy as np
avg_sum = []
sq_sum = []
N_val = 2**np.arange(15)
for N in N_val:
A = np.random.random((N,3)) - 0.5
avg_sum.append( A.sum(axis=1).mean() )
sq_sum.append ( (A**2).sum(axis=1).mean() )
import pylab as plt
plt.plot(N_val, avg_sum, label="Average sum")
plt.plot(N_val, sq_sum, label="Squared sum")
plt.legend(loc="best")
plt.show()
The average sum goes to zero as your intuition expects.

Categories

Resources