Is there any method to generate non-repeated float random numbers in a range with given size and standard deviation?
I generate e.g. 1000 random floats between a min and max value:
randSize= 1000
randValues = np.random.uniform(low=myMinVal, high=myMaxVal, size (randSize,))
But I want to generate only numbers, which have less than 0.2 SD in that range
Your question is a bit unclear. My understanding is that you want to extract 1000 float from a normal distribution with mean=0.0 and sigma=0.2, to me the easiest way is to use:
mu, sigma = 0, 0.2 # mean and standard deviation
s = numpy.random.normal(mu, sigma, 1000)
see here.
Now, as you know that the probability of obtaining the same float twice is very low, but if this is a requirement, an easy way tackling that is:
dim = 1000
original_list = list(set(np.random.normal(mu, sigma, 2*dim).tolist()))[:dim]
Explanation: I create an array of float double the size required, convert to a list and then to a set. By definition a set contains unique values, so potential duplicates are deleted. Now I get back to a list and cut it to the size you want: dim.
Related
I have the following problem:
I want to generate a 100x100 grid (numpy.ndarray) by filling it with numbers, out from a given list ([-1,0,1,2]). I want to distribute them randomly on this grid. Also, the numbers must maintain the following ratios: the number 0 must occupy 10% of the grid, while the remaining numbers have a 30% ratio each, so their sum equals 100%. Using np.random.choice() I was able to generate random numbers, each distributed with the associated probabilities. However, I run into problems because I have to make sure that the number 0 makes exactly 10% of the entire grid, and the non-zero numbers exactly 30% each. Using the np.random.choice() function, this is not always the case (especially if the sample size is small), because I have only assigned probabilities, and not ratios:
import numpy as np
numbers = np.random.choice([-1,0,1,2],(100,100),p=[0.3,0.1,0.3,0.3])
print(np.count_nonzero(numbers)) #must be = 0.1 always!
Another idea I had was to initially set the entire matrix as np.zeros((100,100)) and then fill up only 90% of it with non-zero elements, however, I don't how to approach this problem such that the numbers are distributed randomly on the grid, i.e., random location/index.
Edit: The ratio of each individual non-zero number in the grid will only depend on how many cells I want to be empty, or 0 in that case. All other non-zero elements must have the same ratio. For example, I want to have 20% of the grid to be zeros, the remaining numbers will have a ratio of (1 - ratio_of_zero)/amount_of_non-zero_elements.
This should do what you want it to (suggested by RemcoGerlich), though I don't know how efficient this method is:
import numpy as np
# Constants
SHAPE = (100, 100)
LENGTH = SHAPE[0] * SHAPE[1]
REST = [-1, 1, 2]
ZERO_PROB = 10
BASE_PROB = (100 - ZERO_PROB) // len(REST)
NUM_ZERO = round(LENGTH * (ZERO_PROB / 100))
NUM_REST = round(LENGTH * (BASE_PROB / 100))
# Make base 1D array
base_arr = [0 for _ in range(NUM_ZERO)]
for i in REST:
base_arr += [i for _ in range(NUM_REST)]
base_arr = np.array(base_arr)
# Give it a random order
np.random.shuffle(base_arr)
# Finally, reshape the array
result_arr = base_arr.reshape(SHAPE)
Looking at your comment, for flexibility that depends on how many of the numbers are to have different probabilities I suppose. You could just have a for loop which goes through and makes an array the right length for each one to add to the base_arr. Also, this can of course be a function you pass variables into rather than just a script with hard coded constants like this.
Edited based on comment.
For some vector m (of length N) of numbers in R we can write
rnorm(N, mean = m, sd = 1)
and this will give a vector of length N where each element will be a sample for a normal distribution centred at the different elements of m. My question is, is it possible to do the same easily with numpy? As far as I can tell numpy.random.normal() requires the loc to be the same for all the elements. The point is that I want a random vector with different means.
Also while writing this, would it work to sample from a standard normal distribution and transform this sample? That would be easier.
One way you can do is random sampling at center 0 then move the sample:
m, N = np.array([1,2,3]), 1000
np.random.seed(42)
samples = np.random.randn(N,len(m)) + m
I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as
sum = 0
for i in range(a,b+1):
sum += bin(i)
f(x) = bin(x)/sum
One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.
The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.
What's a nice/efficient way to achieve this?
This is a way to collect all the values of bin in a pretty short time:
from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
x = np.arange(n+1)
return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
It can be done in a quarter of microsecond for n=1000.
Sample run:
>>> distribution(4):
array([0.0625, 0.25 , 0.375 , 0.25 , 0.0625])
You can sum specific parts of this array like so:
>>> np.sum(distribution(4)[2:4])
0.625
Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.
Bugfix
You can use scipy.stats.binom equivalently:
from scipy.stats import binom
def distribution(n, p):
return binom.pmf(np.arange(n+1), n, p)
This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.
Another way would be to use the CDF and it's inverse, something like:
from scipy import stats
dist = stats.binom(100, 0.5)
# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])
# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())
should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution
note that for large values you might as well use the normal approximation
I am trying to understand what is the difference, if any, between these functions:
numpy.random.rand()
numpy.random.random()
numpy.random.uniform()
It seems that they produce a random sample from a uniform distribution. So, without any parameter in the function, is there any difference?
numpy.random.uniform(low=0.0, high=1.0, size=None) - uniform samples from arbitrary range
Draw samples from a uniform distribution.
Samples are uniformly distributed over the half-open interval [low, high) (includes low, but excludes high). In other words, any value within the given interval is equally likely to be drawn by uniform.
numpy.random.random(size=None) - uniform distribution between 0 and 1
Return random floats in the half-open interval [0.0, 1.0).
Results are from the “continuous uniform” distribution over the stated interval. To sample Unif[a, b), b > a multiply the output of random_sample by (b-a) and add a:
(b - a) * random_sample() + a
numpy.random.rand(d0, d1, ..., dn) - Samples from a uniform distribution to populate an array of a given shape
Random values in a given shape.
Create an array of the given shape and propagate it with random samples from a uniform distribution over [0, 1).
To answer your other question, given all default parameters all of the functions numpy.random.uniform, numpy.random.random, and numpy.random.rand are identical.
Short answer
Without parameters, the three functions are equivalent, producing a random float in the range [0.0,1.0).
Details
numpy.random.rand is a convenience function that accepts an arbitrary number of parameters as dimensions. It's different from the other numpy.random functions, numpy.zeros, and numpy.ones also, in that all of the others accept shapes, i.e. N-tuples (specified as Python lists or tuples). The following two lines produce identical results (the random seed notwithstanding):
import numpy as np
x = np.random.random_sample((1,2,3)) # a single tuple as parameter
x = np.random.rand(1,2,3) # integers as parameters
numpy.random.random is an alias for numpy.random.random_sample.
numpy.random.uniform allows you to specify the limits of the distribution, with the low and high keyword parameters, instead of using the default [0.0,1.0).
I want to specify the probability density function of a distribution and then pick up N random numbers from that distribution in Python. How do I go about doing that?
In general, you want to have the inverse cumulative probability density function. Once you have that, then generating the random numbers along the distribution is simple:
import random
def sample(n):
return [ icdf(random.random()) for _ in range(n) ]
Or, if you use NumPy:
import numpy as np
def sample(n):
return icdf(np.random.random(n))
In both cases icdf is the inverse cumulative distribution function which accepts a value between 0 and 1 and outputs the corresponding value from the distribution.
To illustrate the nature of icdf, we'll take a simple uniform distribution between values 10 and 12 as an example:
probability distribution function is 0.5 between 10 and 12, zero elsewhere
cumulative distribution function is 0 below 10 (no samples below 10), 1 above 12 (no samples above 12) and increases linearly between the values (integral of the PDF)
inverse cumulative distribution function is only defined between 0 and 1. At 0 it is 10, at 12 it is 1, and changes linearly between the values
Of course, the difficult part is obtaining the inverse cumulative density function. It really depends on your distribution, sometimes you may have an analytical function, sometimes you may want to resort to interpolation. Numerical methods may be useful, as numerical integration can be used to create the CDF and interpolation can be used to invert it.
This is my function to retrieve a single random number distributed according to the given probability density function. I used a Monte-Carlo like approach. Of course n random numbers can be generated by calling this function n times.
"""
Draws a random number from given probability density function.
Parameters
----------
pdf -- the function pointer to a probability density function of form P = pdf(x)
interval -- the resulting random number is restricted to this interval
pdfmax -- the maximum of the probability density function
integers -- boolean, indicating if the result is desired as integer
max_iterations -- maximum number of 'tries' to find a combination of random numbers (rand_x, rand_y) located below the function value calc_y = pdf(rand_x).
returns a single random number according the pdf distribution.
"""
def draw_random_number_from_pdf(pdf, interval, pdfmax = 1, integers = False, max_iterations = 10000):
for i in range(max_iterations):
if integers == True:
rand_x = np.random.randint(interval[0], interval[1])
else:
rand_x = (interval[1] - interval[0]) * np.random.random(1) + interval[0] #(b - a) * random_sample() + a
rand_y = pdfmax * np.random.random(1)
calc_y = pdf(rand_x)
if(rand_y <= calc_y ):
return rand_x
raise Exception("Could not find a matching random number within pdf in " + max_iterations + " iterations.")
In my opinion this solution is performing better than other solutions if you do not have to retrieve a very large number of random variables. Another benefit is that you only need the PDF and avoid calculating the CDF, inverse CDF or weights.