Computing percentiles when given a distribution - python

Let's say I have a vector of values, and a vector of probabilities. I want to compute the percentile over the values, but using the given vector of probabilities.
Say, for example,
import numpy as np
vector = np.array([4, 2, 3, 1])
probs = np.array([0.7, 0.1, 0.1, 0.1])
Ignoring probs, np.percentile(vector, 10) gives me 1.3. However, it's clear that the lowest 10% here have value of 1, so that would be my desired output.
If the result lies between two data points, I'd prefer linear interpolation as documented for the original percentile function.
How would I solve this in Python most conveniently? As in my example, vector will not be sorted. probs always sums to 1. I'd prefer solutions that don't require "non-standard" packages, by any reasonable definition.

If you're prepared to sort your values, then you can construct an interpolating function that allows you to compute the inverse of the probability distribution. This is probably more easily done with scipy.interpolate than with pure numpy routines:
import scipy.interpolate
ordering = np.argsort(vector)
distribution = scipy.interpolate.interp1d(np.cumsum(probs[ordering]), vector[ordering], bounds_error=False, fill_value='extrapolate')
If you interrogate this distribution with the percentile (in the range 0..1), you should get the answers you want, e.g. distribution(0.1) gives 1.0, distribution(0.5) gives about 3.29.
A similar thing can be done with numpy's interp() function, avoiding the extra dependency on scipy, but that would involve reconstructing the interpolating function every time you want to calculate a percentile. This might be fine if you have a fixed list of percentiles that is known before you estimate the probability distribution.

One solution would be to use sampling via numpy.random.choice and then numpy.percentile:
N = 50 # number of samples to draw
samples = np.random.choice(vector, size=N, p=probs, replace=True)
interpolation = "nearest"
print("25th percentile",np.percentile(samples, 25, interpolation=interpolation),)
print("75th percentile",np.percentile(samples, 75, interpolation=interpolation),)
Depending on your kind of data (discrete or continuous) you may want to use different values for the interpolation parameter.


Weighted 1D interpolation of cloud data point

I have a cloud of data points (x,y) that I would like to interpolate and smooth.
Currently, I am using scipy :
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
spl = interp1d(Cloud[:,1], Cloud[:,0]) # interpolation
x = np.linspace(Cloud[:,1].min(), Cloud[:,1].max(), 1000)
smoothed = savgol_filter(spl(x), 21, 1) #smoothing
This is working pretty well, except that I would like to give some weights to the data points given at interp1d. Any suggestion for another function that is handling this ?
Basically, I thought that I could just multiply the occurrence of each point of the cloud according to its weight, but that is not very optimized as it increases a lot the number of points to interpolate, and slows down the algorithm ..
The default interp1d uses linear interpolation, i.e., it simply computes a line between two points. A weighted interpolation does not make much sense mathematically in such scenario - there is only one way in euclidean space to make a straight line between two points.
Depending on your goal, you can look into other methods of interpolation, e.g., B-splines. Then you can use scipy's scipy.interpolate.splrep and set the w argument:
w - Strictly positive rank-1 array of weights the same length as x and y. The weights are used in computing the weighted least-squares spline fit. If the errors in the y values have standard-deviation given by the vector d, then w should be 1/d. Default is ones(len(x)).

`fft` not returning what it should

I am trying to perform Fourier transform using numpy's fft as follows:
import numpy as np
import matplotlib.pyplot as plt
t = np.linspace(0,1, 128)
x = np.cos(2*np.pi*t)
s_fft = np.fft.fft(x)
s_fft_freq = np.fft.fftshift(np.fft.fftfreq(t.shape[-1], t[1]-t[0]))
plt.plot(s_fft_freq, np.abs(s_fft))
The result I get is
which is wrong, as I know the FT should peak at f = 1, as the frequency of the cos is 1.
What am I doing wrong?
You are only applying fftshift to the x-axis labels, not the actual FFT magnitudes - you just need to apply s_fft = np.fft.fftshift(np.fft.fft(x)) too.
There are 2 or 3 things you have gotten wrong:
The FFT will peak at two positions for a pure real-valued frequency. This is the plus and minus frequencies. The only way to get a single peak in the Fourier domain is by having a complex valued signal (or having the trivial DC component).
(if with f, you mean frequency index) When using the DFT, the number of samples will determine how many frequency components you have. At the highest frequency index, you are always close to the per-sample oscilation: (-1)^t
(if with f, you mean amplitude) There are many definitions of the DFT, affecting both the forward and backward transform. This will affect how the values are interpreted when reading the spectrum.

How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!

Separate mixture of gaussians in Python

There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.
Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?
Original data, for example:
This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).
This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:
# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture
random.seed(010713) # to make it reproducible
# create a mixture of normals:
# 1000 from N(0, 1)
# 2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
np.random.normal(6, 2, [2000])])
# histogram:
plt.hist(mix, bins=20)
All the above code does is generate and plot the mixture. It looks like this:
Now to actually use PyMix to figure out what the percentages are:
data = mixture.DataSet()
# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
# perform expectation maximization
m.EM(data, 40, .1)
print m
The output model of this is:
G = 2
p = 1
pi =[ 0.33307859 0.66692141]
compFix = [0, 0]
Component 0:
Normal: [0.0360178848449, 1.03018725918]
Component 1:
Normal: [5.86848468319, 2.0158608802]
Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.
A few notes:
This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.

Get the formula of a interpolation function created by scipy

I have done some work in Python, but I'm new to scipy. I'm trying to use the methods from the interpolate library to come up with a function that will approximate a set of data.
I've looked up some examples to get started, and could get the sample code below working in Python(x,y):
import numpy as np
from scipy.interpolate import interp1d, Rbf
import pylab as P
# show the plot (empty for now)
# generate random input data
original_data = np.linspace(0, 1, 10)
# random noise to be added to the data
noise = (np.random.random(10)*2 - 1) * 1e-1
# calculate f(x)=sin(2*PI*x)+noise
f_original_data = np.sin(2 * np.pi * original_data) + noise
# create interpolator
rbf_interp = Rbf(original_data, f_original_data, function='gaussian')
# Create new sample data (for input), calculate f(x)
#using different interpolation methods
new_sample_data = np.linspace(0, 1, 50)
rbf_new_sample_data = rbf_interp(new_sample_data)
# draw all results to compare
P.plot(original_data, f_original_data, 'o', ms=6, label='f_original_data')
P.plot(new_sample_data, rbf_new_sample_data, label='Rbf interp')
The plot is displayed as follows:
Now, is there any way to get a polynomial expression representing the interpolated function created by Rbf (i.e. the method created as rbf_interp)?
Or, if this is not possible with Rbf, any suggestions using a different interpolation method, another library, or even a different tool are also welcome.
The RBF uses whatever functions you ask, it is of course a global model, so yes there is a function result, but of course its true that you will probably not like it since it is a sum over many gaussians. You got:
rbf.nodes # the factors for each of the RBF (probably gaussians)
rbf.xi # the centers.
rbf.epsilon # the width of the gaussian, but remember that the Norm plays a role too
So with these things you can calculate the distances (with rbf.xi then pluggin the distances with the factors in rbf.nodes and rbf.epsilon into the gaussian (or whatever function you asked it to use). (You can check the python code of __call__ and _call_norm)
So you get something like sum(rbf.nodes[i] * gaussian(rbf.epsilon, sqrt((rbf.xi - center)**2)) for i, center in enumerate(rbf.nodes)) to give some funny half code/formula, the RBFs function is written in the documentation, but you can also check the python code.
The answer is no, there is no "nice" way to write down the formula, or at least not in a short way. Some types of interpolations, like RBF and Loess, do not directly search for a parametric mathematical function to fit to the data and instead they calculate the value of each new data point separately as a function of the other points.
These interpolations are guaranteed to always give a good fit for your data (such as in your case), and the reason for this is that to describe them you need a very large number of parameters (basically all your data points). Think of it this way: you could interpolate linearly by connecting consecutive data points with straight lines. You could fit any data this way and then describe the function in a mathematical form, but it would take a large number of parameters (at least as many as the number of points). Actually what you are doing right now is pretty much a smoothed version of that.
If you want the formula to be short, this means you want to describe the data with a mathematical function that does not have many parameters (specifically the number of parameters should be much lower than the number of data points). Such examples are logistic functions, polynomial functions and even the sine function (that you used to generate the data). Obviously, if you know which function generated the data that will be the function you want to fit.
RBF likely stands for Radial Basis Function. I wouldn't be surprised if scipy.interpolate.Rbf was the function you're looking for.
However, I doubt you'll be able to find a polynomial expression to represent your result.
If you want to try different interpolation methods, check the corresponding Scipy documentation, that gives link to RBF, splines...
I don’t think SciPy’s RBF will give you the actual function. But one thing that you could do is sample the function that SciPy’s RBF gave you (ie 100 points). Then use Lagrange interpretation with those points. This will generate a polynomial function for you. Here is an example on how this would look. If you do not want to use Lagrange interpolation, You can also use “Newton’s dividend difference method” to generate a polynomial function.
My answer is based on numpy only :
import matplotlib.pyplot as plt
import numpy as np
x_data = [324, 531, 806, 1152, 1576, 2081, 2672, 3285, 3979, 4736]
y_data = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
x = np.array(x_data)
y = np.array(y_data)
model = np.poly1d(np.polyfit(x, y, 2))
ynew = model(x)
plt.plot(x, y, 'o', x, ynew, '-' , )
plt.ylabel( str(model).strip())

