How can I generate a uniformly distributed [-1,1]^d data in Python? E.g. d is a dimension like 10.
I know how to generate uniformly distributed data like np.random.randn(N) but dimension thing is confused me a lot.
Assuming independence of the individual coordinates, then the following will generate a random point in [-1, 1)^d
np.random.random(d) * 2 - 1
The following will generate n observations, where each row is an observation
np.random.random((n, d)) * 2 - 1
As has been pointed out, randn produces normally distributed number (aka Gaussian). To get uniformly distributed you should use "uniform".
If you just want a single sample at a time of 10 uniformly distributed numbers you can use:
import numpy as np
x = np.random.uniform(low=-1,high=1,size=10)
OR if you'd like to generate lots (e.g. 100) of them at once then you can do:
import numpy as np
X = np.random.uniform(low=-1,high=1,size=(100,10))
Now X[0], X[1], ... each has length 10.
You can import the random module and call random.random to get a random sample from [0, 1). You can double that and subtract 1 to get a sample from [-1, 1).
Draw d values this way and the tuple will be a uniform draw from the cube [-1, 1)^d.
Without numpy:
[random.choice([-1,1]) for _ in range(N)]
There may be reasons to use numpy's internal mechanisms, or use random() manually, etc. But those are implementation details, and also related to how the random number generation rations the bits of entropy the operating system provides.
Related
For some vector m (of length N) of numbers in R we can write
rnorm(N, mean = m, sd = 1)
and this will give a vector of length N where each element will be a sample for a normal distribution centred at the different elements of m. My question is, is it possible to do the same easily with numpy? As far as I can tell numpy.random.normal() requires the loc to be the same for all the elements. The point is that I want a random vector with different means.
Also while writing this, would it work to sample from a standard normal distribution and transform this sample? That would be easier.
One way you can do is random sampling at center 0 then move the sample:
m, N = np.array([1,2,3]), 1000
np.random.seed(42)
samples = np.random.randn(N,len(m)) + m
I'm trying to pick numbers from an array at random.
I can easily do this by picking an element using np.random.randint(len(myArray)) - but that gives a uniform distribution.
For my needs I need to pick a random number with higher probability of picking a number near the beginning of the array - so I think something like an exponential probability function would suit better.
Is there a way for me to generate a random integer in a range (1, 1000) using an exponential (or other, non-uniform distribution) to use as an array index?
You can assign an exponential probability vector to choice module from NumPy. The probability vector should add up to 1 therefore you normalize it by the sum of all probabilities.
import numpy as np
from numpy.random import choice
arr = np.arange(0, 1001)
prob = np.exp(arr/1000) # To avoid a large number
rand_draw = choice(arr, 1, p=prob/sum(prob))
To make sure the random distribution follows exponential behavior, you can plot it for 100000 random draws between 0 and 1000.
import matplotlib.pyplot as plt
# above code here
rand_draw = choice(arr, 100000, p=prob/sum(prob))
plt.hist(rand_draw, bins=100)
plt.show()
If I have a an N^3 array of triplets in a numpy array, how do I do a vector sum on all of the triplets in the array? For some reason I just can't wrap my brain around the summation indices. Here is what I tried, but it doesn't seem to work:
a = np.random.random((5,5,5,3)) - 0.5
s = a.sum((0,1,2))
np.linalg.norm(s)
I would expect that as N gets large, if the sum is working correctly I should converge to 0, but I just keep getting bigger. The sum gives me a vector that is the correct shape (3x1), but obviously I must be doing something wrong. I know this should be easy, but I'm just not getting it.
Thanks in advance!
Is is easier to understand you problem analytically if instead of uniform random numbers we use standard normal numbers, and the qualitative results can be applied to your particular case:
>>> a = np.random.normal(0, 1, size=(5, 5, 5, 3))
>>> s = a.sum(axis=(0, 1, 2))
So now each of the three items of s is the sum of 125 numbers, each drawn from a standard normal distribution. It is a well established fact that adding up two normal distributions gives you another normal distribution with mean the sum of the means, and variance the sum of the variances. So each of the three values in s will be distributed as a random sample from a normal distribution with mean 0 and standard deviation sqrt(125) = 11.18.
The fact that the variance of the distribution grows means that, even though if you run your code many times, you will see an average value of 0 for each of those numbers, on any given run you are more likely to see larger offsets from 0.
Furthermore you then go and compute the norm of those three values. Squaring three standard normal distributions and adding them together gives you a chi-squared distribution. If you then take the square root, you get a chi distribution. The former is easier to deal with, and it predicts that the average value of the square of the norm of your three values will be 3 * 125. And it most certainly seems to be:
>>> mean_norm_sq = 0
>>> for n in xrange(1000):
... a = np.random.normal(0, 1, size=(5, 5, 5, 3))
... s = a.sum(axis=(0, 1, 2))
... mean_norm_sq += np.sum(s**2)
...
>>> mean_norm_sq / 1000
374.47629802482447
As the comments note, there is no reason why the squared sum should approach zero. By the description, an array of N three-dimensional vectors sounds like it should have the shape of (N,3) not (N,N,N,3), but I may be misunderstanding it. Either way, it is simple to observe what happens in the two cases:
import numpy as np
avg_sum = []
sq_sum = []
N_val = 2**np.arange(15)
for N in N_val:
A = np.random.random((N,3)) - 0.5
avg_sum.append( A.sum(axis=1).mean() )
sq_sum.append ( (A**2).sum(axis=1).mean() )
import pylab as plt
plt.plot(N_val, avg_sum, label="Average sum")
plt.plot(N_val, sq_sum, label="Squared sum")
plt.legend(loc="best")
plt.show()
The average sum goes to zero as your intuition expects.
There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.
Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?
Original data, for example:
This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).
This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:
# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture
random.seed(010713) # to make it reproducible
# create a mixture of normals:
# 1000 from N(0, 1)
# 2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
np.random.normal(6, 2, [2000])])
# histogram:
plt.hist(mix, bins=20)
plt.savefig("mixture.pdf")
All the above code does is generate and plot the mixture. It looks like this:
Now to actually use PyMix to figure out what the percentages are:
data = mixture.DataSet()
data.fromArray(mix)
# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
# perform expectation maximization
m.EM(data, 40, .1)
print m
The output model of this is:
G = 2
p = 1
pi =[ 0.33307859 0.66692141]
compFix = [0, 0]
Component 0:
ProductDist:
Normal: [0.0360178848449, 1.03018725918]
Component 1:
ProductDist:
Normal: [5.86848468319, 2.0158608802]
Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.
A few notes:
This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.
I need to generate a vector of random float numbers between [0,1] such
that their sum equals 1 and that are distributed non-uniformly.
Is there any python function that generates such a vector?
Best wishes
The distribution you are probably looking for is called the Dirichlet distribution. There's no built-in function in Python for drawing random numbers from a Dirichlet distribution, but NumPy contains one:
>>> from numpy.random.mtrand import dirichlet
>>> print dirichlet([1] * n)
This will give you n numbers that sum up to 1, and the probability of each such combination will be equal.
Alternatively, if you don't have NumPy, you can make use of the fact that a random sample drawn from an n-dimensional Dirichlet distribution can be generated by drawing n independent samples from a gamma distribution with shape and scale parameters equal to 1 and then dividing the samples with the sum:
>>> from random import gammavariate
>>> def dirichlet(n):
... samples = [gammavariate(1, 1) for _ in xrange(n)]
... sum_samples = sum(samples)
... return [x/sum_samples for x in samples]
The reason why you need a Dirichlet distribution is because if you simply draw random numbers uniformly from some interval and then divide them by the sum of them, the resulting distribution will be biased towards samples consisting of roughly equal numbers. See Luc Devroye's book for more on this topic.
There is a nicer example in Wikipedia page: Dirichlet distribution.
The code below generate a k dimension sample:
params = [a1, a2, ..., ak]
sample = [random.gammavariate(a,1) for a in params]
sample = [v/sum(sample) for v in sample]