Randomly generate integers with a distribution that prefers low ones - python

I have an list ordered by some quality function from which I'd like to take elements, preferring the good elements at the beginning of the list.
Currently, my function to generate the random indices looks essentially as follows:
def pick():
p = 0.2
for i in itertools.count():
if random.random() < p:
break
return i
It does a good job, but I wonder:
What's the name of the generated random distribution?
Is there a built-in function in Python for that distribution?

What you are describing sounds a lot like the exponential distribution. It already exists in the random module.
Here is some code that takes just the integer part of sampling from an exponential distribution with a rate parameter of 100.
import random
import matplotlib.pyplot as plt
d = [int(random.expovariate(1/100)) for i in range(10000)]
h,b = np.histogram(d, bins=np.arange(0,max(d)))
plt.bar(left=b[:-1], height=h, ec='none', width=1))
plt.show()

You could simulate it via exponential, but this is like making square peg fit round hole. As Mark said, it is geometric distribution - discrete, shifted by 1. And it is right here in the numpy:
import numpy as np
import random
import itertools
import matplotlib.pyplot as plt
p = 0.2
def pick():
for i in itertools.count():
if random.random() < p:
break
return i
q = np.random.geometric(p, size = 100000) - 1
z = [pick() for i in range(100000)]
bins = np.linspace(-0.5, 30.5, 32)
plt.hist(q, bins, alpha=0.2, label='geom')
plt.hist(z, bins, alpha=0.2, label='pick')
plt.legend(loc='upper right')
plt.show()
Output:

random.random() defaults to a uniform distribution, but there are other methods within random that would also work. For your given use case, I would suggest random.expovariate(2) (Documentation, Wikipedia). This is an exponential distribution that will heavily prefer lower values. If you google some of the other methods listed in the documentation, you can find some other built-in distributions.
Edit: Be sure to play around with the argument value for expovariate. Also note that it doesn't guarantee a value less than 1, so you might need to ensure that you only use values less than 1.

Related

Select a non uniformely distributed random element from an array

I'm trying to pick numbers from an array at random.
I can easily do this by picking an element using np.random.randint(len(myArray)) - but that gives a uniform distribution.
For my needs I need to pick a random number with higher probability of picking a number near the beginning of the array - so I think something like an exponential probability function would suit better.
Is there a way for me to generate a random integer in a range (1, 1000) using an exponential (or other, non-uniform distribution) to use as an array index?
You can assign an exponential probability vector to choice module from NumPy. The probability vector should add up to 1 therefore you normalize it by the sum of all probabilities.
import numpy as np
from numpy.random import choice
arr = np.arange(0, 1001)
prob = np.exp(arr/1000) # To avoid a large number
rand_draw = choice(arr, 1, p=prob/sum(prob))
To make sure the random distribution follows exponential behavior, you can plot it for 100000 random draws between 0 and 1000.
import matplotlib.pyplot as plt
# above code here
rand_draw = choice(arr, 100000, p=prob/sum(prob))
plt.hist(rand_draw, bins=100)
plt.show()

Numpy Trapezoidal Distribution for Age Distribution

I'm trying to create a rough model of US population distribution to generate random ages for a sample population, with the following image as a source, of sorts.
I feel that this could be most simply modeled by a trapezoidal distribution that remains uniform until dropping off at around the age of 50. However it seems that numpy does not offer the ability to utilize this distribution function. Because of this, I was wondering if it is possible to "combine" two distribution functions (in this case, a uniform distribution function with a maximum value of 50, and a triangular distribution function with a minimum of 51 and a maximum of 100). Is this possible, and is there a way to directly express a trapezoidal distribution function in python?
Yes, you can combine the samples arbitrarily. Just use np.concatenate
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
def agedistro(turn,end,size):
pass
totarea = turn + (end-turn)/2 # e.g. 50 + (90-50)/2
areauptoturn = turn # say 50
areasloped = (end-turn)/2 # (90-50)/2
size1= int(size*areauptoturn/totarea)
size2= size- size1
s1 = np.random.uniform(low=0,high=turn,size= size1) # (low=0.0, high=1.0, size=None)
s2 = np.random.triangular(left=turn,mode=turn,right=end,size=size2) #(left, mode, right, size=None)
# mode : scalar- the value where the peak of the distribution occurs.
#The value should fulfill the condition left <= mode <= right.
s3= np.concatenate((s1,s2)) # don't use add , it will add the numbers piecewise
return s3
s3=agedistro(turn=50,end=90,size=1000000)
p.hist(s3,bins=50)
p.show()

Generate bins with even number of samples in each

I understand that I can generate bins for arrays with numpy using numpy.histogram or numpy.digitize, and have in the past. However, the analysis I need to do requires there to be an even number of samples in each bin, where the data is not uniformly distributed.
Say I have approximately normally distributed data in an array, A = numpy.random.random(1000). How can I bin this data (either by creating an index or finding values which define the extents of each bin) in a way that there is an even number of samples in each?
I know this can be treated as an optimization problem, and can solve it as such, i.e.:
import numpy as np
from scipy.optimize import fmin
def generate_even_bins(A, n):
x0 = np.linspace(A.min(), A.max(), n)
def bin_counts(x, A):
if any(np.diff(x)) <= 0:
return 1e6
else:
binned = np.digitize(A, x)
return np.abs(np.diff(np.bincount(binned))).sum()
return fmin(bin_counts, x0, args=(A,))
... but is there something already available, perhaps in numpy or scipy.stats that implements this? If not shouldn't there be?

Random numbers that follow a linear drop distribution in Python

I'd like to generate random numbers that follow a dropping linear frequency distribution, take n=1-x for an example.
The numpy library however seems to offer only more complex distributions.
So, it turns out you can totally use random.triangular(0,1,0) for this. See documentation here: https://docs.python.org/2/library/random.html
random.triangular(low, high, mode)
Return a random floating point number N such that low <= N <= high and with the specified mode between those bounds.
Histogram made with matplotlib:
import matplotlib.pyplot as plt
import random
bins = [0.1 * i for i in range(12)]
plt.hist([random.triangular(0,1,0) for i in range(2500)], bins)
For denormalized PDF with density
1-x, in the range [0...1)
normalization constant is 1/2
CDF is equal to 2x-x^2
Thus, sampling is quite obvious
r = 1.0 - math.sqrt(random.random())
Sample program produced pretty much the same plot
import math
import random
import matplotlib.pyplot as plt
bins = [0.1 * i for i in range(12)]
plt.hist([(1.0 - math.sqrt(random.random())) for k in range(10000)], bins)
plt.show()
UPDATE
let's denote S to be an integral, and S_a^b is definite integral from a to b.
So
Denormalized PDF(x) = 1-x
Normalization:
N = S_0^1 (1-x) dx = 1/2
Thus, normalized PDF
PDF(x) = 2*(1-x)
Let's compute CDF
CDF(x) = S_0^x PDF(x) dx = 2x - x*x
Checking: CDF(0) = 0, CDF(1) = 1
Sampling is via inverse CDF method, by solving for x
CDF(x) = U(0,1)
where U(0,1) is uniform random in [0,1)
This is simple quadratic equation with solution
x = 1 - sqrt(1 - U(0,1)) = 1 - sqrt(U(0,1))
which translated directly into Python code

Plot function with large binomial coefficients

I would like to plot a function which involves binomial coefficients. The code I have is
#!/usr/bin/python
from __future__ import division
from scipy.special import binom
import matplotlib.pyplot as plt
import math
max = 500
ycoords = [sum([binom(n,w)*sum([binom(w,k)*(binom(w,k)/2**w)**(4*n/math.log(n)) for k in xrange(w+1)]) for w in xrange(1,n+1)]) for n in xrange(2,max)]
xcoords = range(2,max)
plt.plot(xcoords, ycoords)
plt.show()
Unfortunately this never terminates. If you reduce max to 40 say it works fine. Is there some way to plot this function?
I am also worried that scipy.special.binom might not be giving accurate answers as it works in floating point it seems.
You can get significant speedup by using numpy to compute the inner loop. First change max to N (since max is a builtin) and break up your function into smaller, more manageable chunks:
N = 500
X = np.arange(2,N)
def k_loop(w,n):
K = np.arange(0, w+1)
return (binom(w,K)*(binom(w,K)/2**w)**(float(n)/np.log(n))).sum()
def w_loop(n):
v = [binom(n,w)*k_loop(w,n) for w in range(1,n+1)]
return sum(v)
Y = [w_loop(n) for n in X]
Using N=300 as a test it takes 3.932s with the numpy code, but 81.645s using your old code. I didn't even time the N=500 case since your old code took so long!
It's worth pointing out that your function is basically exponential growth and can be approximated as such. You can see this in a semilogx plot:

Categories

Resources