If I have a an N^3 array of triplets in a numpy array, how do I do a vector sum on all of the triplets in the array? For some reason I just can't wrap my brain around the summation indices. Here is what I tried, but it doesn't seem to work:
a = np.random.random((5,5,5,3)) - 0.5
s = a.sum((0,1,2))
np.linalg.norm(s)
I would expect that as N gets large, if the sum is working correctly I should converge to 0, but I just keep getting bigger. The sum gives me a vector that is the correct shape (3x1), but obviously I must be doing something wrong. I know this should be easy, but I'm just not getting it.
Thanks in advance!
Is is easier to understand you problem analytically if instead of uniform random numbers we use standard normal numbers, and the qualitative results can be applied to your particular case:
>>> a = np.random.normal(0, 1, size=(5, 5, 5, 3))
>>> s = a.sum(axis=(0, 1, 2))
So now each of the three items of s is the sum of 125 numbers, each drawn from a standard normal distribution. It is a well established fact that adding up two normal distributions gives you another normal distribution with mean the sum of the means, and variance the sum of the variances. So each of the three values in s will be distributed as a random sample from a normal distribution with mean 0 and standard deviation sqrt(125) = 11.18.
The fact that the variance of the distribution grows means that, even though if you run your code many times, you will see an average value of 0 for each of those numbers, on any given run you are more likely to see larger offsets from 0.
Furthermore you then go and compute the norm of those three values. Squaring three standard normal distributions and adding them together gives you a chi-squared distribution. If you then take the square root, you get a chi distribution. The former is easier to deal with, and it predicts that the average value of the square of the norm of your three values will be 3 * 125. And it most certainly seems to be:
>>> mean_norm_sq = 0
>>> for n in xrange(1000):
... a = np.random.normal(0, 1, size=(5, 5, 5, 3))
... s = a.sum(axis=(0, 1, 2))
... mean_norm_sq += np.sum(s**2)
...
>>> mean_norm_sq / 1000
374.47629802482447
As the comments note, there is no reason why the squared sum should approach zero. By the description, an array of N three-dimensional vectors sounds like it should have the shape of (N,3) not (N,N,N,3), but I may be misunderstanding it. Either way, it is simple to observe what happens in the two cases:
import numpy as np
avg_sum = []
sq_sum = []
N_val = 2**np.arange(15)
for N in N_val:
A = np.random.random((N,3)) - 0.5
avg_sum.append( A.sum(axis=1).mean() )
sq_sum.append ( (A**2).sum(axis=1).mean() )
import pylab as plt
plt.plot(N_val, avg_sum, label="Average sum")
plt.plot(N_val, sq_sum, label="Squared sum")
plt.legend(loc="best")
plt.show()
The average sum goes to zero as your intuition expects.
Related
I have the following problem:
I want to generate a 100x100 grid (numpy.ndarray) by filling it with numbers, out from a given list ([-1,0,1,2]). I want to distribute them randomly on this grid. Also, the numbers must maintain the following ratios: the number 0 must occupy 10% of the grid, while the remaining numbers have a 30% ratio each, so their sum equals 100%. Using np.random.choice() I was able to generate random numbers, each distributed with the associated probabilities. However, I run into problems because I have to make sure that the number 0 makes exactly 10% of the entire grid, and the non-zero numbers exactly 30% each. Using the np.random.choice() function, this is not always the case (especially if the sample size is small), because I have only assigned probabilities, and not ratios:
import numpy as np
numbers = np.random.choice([-1,0,1,2],(100,100),p=[0.3,0.1,0.3,0.3])
print(np.count_nonzero(numbers)) #must be = 0.1 always!
Another idea I had was to initially set the entire matrix as np.zeros((100,100)) and then fill up only 90% of it with non-zero elements, however, I don't how to approach this problem such that the numbers are distributed randomly on the grid, i.e., random location/index.
Edit: The ratio of each individual non-zero number in the grid will only depend on how many cells I want to be empty, or 0 in that case. All other non-zero elements must have the same ratio. For example, I want to have 20% of the grid to be zeros, the remaining numbers will have a ratio of (1 - ratio_of_zero)/amount_of_non-zero_elements.
This should do what you want it to (suggested by RemcoGerlich), though I don't know how efficient this method is:
import numpy as np
# Constants
SHAPE = (100, 100)
LENGTH = SHAPE[0] * SHAPE[1]
REST = [-1, 1, 2]
ZERO_PROB = 10
BASE_PROB = (100 - ZERO_PROB) // len(REST)
NUM_ZERO = round(LENGTH * (ZERO_PROB / 100))
NUM_REST = round(LENGTH * (BASE_PROB / 100))
# Make base 1D array
base_arr = [0 for _ in range(NUM_ZERO)]
for i in REST:
base_arr += [i for _ in range(NUM_REST)]
base_arr = np.array(base_arr)
# Give it a random order
np.random.shuffle(base_arr)
# Finally, reshape the array
result_arr = base_arr.reshape(SHAPE)
Looking at your comment, for flexibility that depends on how many of the numbers are to have different probabilities I suppose. You could just have a for loop which goes through and makes an array the right length for each one to add to the base_arr. Also, this can of course be a function you pass variables into rather than just a script with hard coded constants like this.
Edited based on comment.
I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as
sum = 0
for i in range(a,b+1):
sum += bin(i)
f(x) = bin(x)/sum
One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.
The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.
What's a nice/efficient way to achieve this?
This is a way to collect all the values of bin in a pretty short time:
from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
x = np.arange(n+1)
return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
It can be done in a quarter of microsecond for n=1000.
Sample run:
>>> distribution(4):
array([0.0625, 0.25 , 0.375 , 0.25 , 0.0625])
You can sum specific parts of this array like so:
>>> np.sum(distribution(4)[2:4])
0.625
Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.
Bugfix
You can use scipy.stats.binom equivalently:
from scipy.stats import binom
def distribution(n, p):
return binom.pmf(np.arange(n+1), n, p)
This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.
Another way would be to use the CDF and it's inverse, something like:
from scipy import stats
dist = stats.binom(100, 0.5)
# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])
# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())
should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution
note that for large values you might as well use the normal approximation
I'm doing matrix inversion in python, and I found it very weird that the result differs by the data scale.
In the code below, it is expected that A_inv/B_inv = B/A. However, it shows that the difference between A_inv/B_inv and B/A becomes larger and larger depend on the data scale... Is this because Python cannot compute matrix inverse precisely for matrix with large values?
Also, I checked the condition number for B, which is a constant ~3.016 no matter the scale is.
Thanks!!!
import numpy as np
from matplotlib import pyplot as plt
D = 30
N = 300
np.random.seed(10)
original_data = np.random.sample([D, N])
A = np.cov(original_data)
A_inv = np.linalg.inv(A)
B_cond = []
diff = []
for k in xrange(1,10):
B = A * np.power(10,k)
B_cond.append(np.linalg.cond(B))
B_inv = np.linalg.inv(B)
### Two measurements of difference are used
diff.append(np.log(np.linalg.norm(A_inv/B_inv - B/A)))
#diff.append(np.max(np.abs(A_inv/B_inv - B/A)))
# print B_cond
plt.figure()
plt.plot(xrange(1,10), diff)
plt.xlabel('data(B) / data(A)')
plt.ylabel('log(||A_inv/B_inv - B/A||)')
plt.savefig('Inversion for large matrix')
I may be wrong, but I think it comes from number representation in machine.
When you are dealing with great numbers, your inverse matrix is going to have very little number in magnitude (close to zero). And clsoe to zero, the representation of the floating number is not precise enough, I guess...
https://en.wikipedia.org/wiki/Floating-point_arithmetic
There is no reason that you should expect np.linalg.norm(A_inv/B_inv - B/A) to be equal to anything special. Instead, you can check the quality of the inverse calculation by multiplying the original matrix by its inverse and checking the determinant, np.linalg.det(A.dot(A_inv)), which should be equal to 1.
There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.
Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?
Original data, for example:
This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).
This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:
# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture
random.seed(010713) # to make it reproducible
# create a mixture of normals:
# 1000 from N(0, 1)
# 2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
np.random.normal(6, 2, [2000])])
# histogram:
plt.hist(mix, bins=20)
plt.savefig("mixture.pdf")
All the above code does is generate and plot the mixture. It looks like this:
Now to actually use PyMix to figure out what the percentages are:
data = mixture.DataSet()
data.fromArray(mix)
# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
# perform expectation maximization
m.EM(data, 40, .1)
print m
The output model of this is:
G = 2
p = 1
pi =[ 0.33307859 0.66692141]
compFix = [0, 0]
Component 0:
ProductDist:
Normal: [0.0360178848449, 1.03018725918]
Component 1:
ProductDist:
Normal: [5.86848468319, 2.0158608802]
Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.
A few notes:
This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.
I need to generate a vector of random float numbers between [0,1] such
that their sum equals 1 and that are distributed non-uniformly.
Is there any python function that generates such a vector?
Best wishes
The distribution you are probably looking for is called the Dirichlet distribution. There's no built-in function in Python for drawing random numbers from a Dirichlet distribution, but NumPy contains one:
>>> from numpy.random.mtrand import dirichlet
>>> print dirichlet([1] * n)
This will give you n numbers that sum up to 1, and the probability of each such combination will be equal.
Alternatively, if you don't have NumPy, you can make use of the fact that a random sample drawn from an n-dimensional Dirichlet distribution can be generated by drawing n independent samples from a gamma distribution with shape and scale parameters equal to 1 and then dividing the samples with the sum:
>>> from random import gammavariate
>>> def dirichlet(n):
... samples = [gammavariate(1, 1) for _ in xrange(n)]
... sum_samples = sum(samples)
... return [x/sum_samples for x in samples]
The reason why you need a Dirichlet distribution is because if you simply draw random numbers uniformly from some interval and then divide them by the sum of them, the resulting distribution will be biased towards samples consisting of roughly equal numbers. See Luc Devroye's book for more on this topic.
There is a nicer example in Wikipedia page: Dirichlet distribution.
The code below generate a k dimension sample:
params = [a1, a2, ..., ak]
sample = [random.gammavariate(a,1) for a in params]
sample = [v/sum(sample) for v in sample]