Distributions and p-values in python - python

I have a big list of numbers, and I would like to create a distribution out of this data, plot it, then find the p-value for every number in my list with regards to the distribution.
Is it possible to do this in python? I can't find it in the matplotlib documentation. Should i be using something else?

I would suggest to look into the stats module of scipy; it offers numerous statistical functions for things like this. For plotting, I would still use matplotlib.

You can use the searchsorted function from the numpy module, which will give you the order of a set of values in an ordered array. you can then transform it to a pvalue just by renormalizing it to the dimension of the original array:
data = sorted(rand(10))
new_data = rand(5)
pvals = searchsorted(data,new_data)*1./len(data)
print pvals
#array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
Well, in fact if you want the pvalues of the original number you don't need any special function at all: the pvalues are just the order in the sorted dataset divided by it's lenght.
If you need the pvalues of new values in respect to your original ones, you can use the snippet I gave you

Related

Merging similar columns in NumPy, probability vector

I have a numpy array of the following shape:
img1
It is a probability vector, where the second row corresponds to a value and the first row to the probability that this value is realized. (e.g. the probability of getting 1.0 is 20%)
When two values are close to each other, I want to merge their columns by adding up the probabilities. In this example I want to have:
img2
My current solution involves 3 loops and is really slow for larger arrays. Does someone know an efficient way to program this in NumPy?
While it won't do exactly what you want, you could try to use np.histogram to tackle the problem.
For example say you just want two "bins" like in your example you could do
import numpy as np
x = np.array([[0.1, 0.2, 0.6, 0.1], [0.0, 1.0, 0.0, 1.01]])
hist, bin_edges = np.histogram(x[1, :], bins=[0, 1.0, 1.5], weights=x[0, :])
and then stack your histogram with the leading bin edges to get your output
print(np.stack([hist, bin_edges[:-1]]))
This will print
[[0.7 0.3]
[0. 1. ]]
You can use the bins parameter to get your desired output. I hope this helps.

Constraints on search space - python - scikit

I'm searching for minimum in 100d space. I'm using gp_minimize from skopt (python 3.6).
space = [(0., 1.) for _ in range(100)]
res = gp_minimize(f, space)
However, I also have a constraint that value in each subsequent dimension is not larger than in the previous dimensions. For example for the case of 5d, point [1, 0.9, 0.9, 0.8, 0.7] is ok, while point [1, 0.3, 0.5, 0.4, 0.2] is not.
How to add this constraint using skopt?
The best way I found is to modify the function f. Choose an upper bound for f, and everywhere in the domain where f is not supposed to be evaluated, have it return this upper bound.
It is straightforward to this that this is a mathematically sound approach, as it doesn't change the minimum and constrains your search space kind of in the same way that Lagrange multipliers do. However, I don't know if it plays nicely with the algorithm, since I don't really know how Bayesian optimization handles wide plateaus.

Computing percentiles when given a distribution

Let's say I have a vector of values, and a vector of probabilities. I want to compute the percentile over the values, but using the given vector of probabilities.
Say, for example,
import numpy as np
vector = np.array([4, 2, 3, 1])
probs = np.array([0.7, 0.1, 0.1, 0.1])
Ignoring probs, np.percentile(vector, 10) gives me 1.3. However, it's clear that the lowest 10% here have value of 1, so that would be my desired output.
If the result lies between two data points, I'd prefer linear interpolation as documented for the original percentile function.
How would I solve this in Python most conveniently? As in my example, vector will not be sorted. probs always sums to 1. I'd prefer solutions that don't require "non-standard" packages, by any reasonable definition.
If you're prepared to sort your values, then you can construct an interpolating function that allows you to compute the inverse of the probability distribution. This is probably more easily done with scipy.interpolate than with pure numpy routines:
import scipy.interpolate
ordering = np.argsort(vector)
distribution = scipy.interpolate.interp1d(np.cumsum(probs[ordering]), vector[ordering], bounds_error=False, fill_value='extrapolate')
If you interrogate this distribution with the percentile (in the range 0..1), you should get the answers you want, e.g. distribution(0.1) gives 1.0, distribution(0.5) gives about 3.29.
A similar thing can be done with numpy's interp() function, avoiding the extra dependency on scipy, but that would involve reconstructing the interpolating function every time you want to calculate a percentile. This might be fine if you have a fixed list of percentiles that is known before you estimate the probability distribution.
One solution would be to use sampling via numpy.random.choice and then numpy.percentile:
N = 50 # number of samples to draw
samples = np.random.choice(vector, size=N, p=probs, replace=True)
interpolation = "nearest"
print("25th percentile",np.percentile(samples, 25, interpolation=interpolation),)
print("75th percentile",np.percentile(samples, 75, interpolation=interpolation),)
Depending on your kind of data (discrete or continuous) you may want to use different values for the interpolation parameter.

How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!

Avoid interpolation problems with numpy float

I need to handle timespans in a library I am creating. My first idea was to keep it simple and codify them as years, with float.
The problems arise, for instance, when I wish to perform interpolations. Say I have
xs = np.array([0, 0.7, 1.2, 3.0]) # times
ys = np.array([np.nan, 124.3, 214.0, np.nan]) # values associated
Outside the [0.7, 1.2] interval I would like to get the value np.nan, but inside, the obvious linear interpolation, particularly
in the extremes.
However, using
#!/usr/bin/python3.5
import numpy as np
from fractions import Fraction
import scipy.interpolate as scInt
if __name__ == "__main__":
xs = np.array([0, 0.7, 1.2, 3.0]) # times
ys = np.array([np.nan, 124.3, 214.0, np.nan]) # values associated
interp = scInt.interp1d(xs, ys)
xsInt = np.array([0, 7/10, 6/5-0.0001, 6/5, 6/5+0.0001])
print(interp(xsInt))
I get
[nan, 124.3, 213.98206, nan, nan]
So, the correct value for 7/10, but a nan for 6/5, which is 1.2. There is no mystery in this, machine representation of floats can cause
things like this. But anyway it is an issue I need to deal with.
My first idea was to double the values in fs, so that I would interpolate in
[x1-eps, x1+eps, x2-eps, x2+eps, ..., xn-eps, xn+eps], repeating twice the ys vector:
[y1, y1, y2, y2, y3, y3, ..., yn, yn]. This works, but it is quite ugly.
Then I though I would use fractions.Fraction instead, but Numpy complained saying that "object arrays are not supported".
A pity, this seemed the way to go, although surely there would be a loss of performance.
There is another side of this problem: it would be nice to be able to create dictionaries where the key is a time of the same kind, and I fear when I search
using a float as a key the same, some searches would fail due to the same issue.
My last idea was to use dates, like datetime.date, but I an not too happy with it because of the ambiguity when converting the difference between dates
to year fractions.
What would be the best approach for this, is there a nice solution?
I think there is just no easy way out of this.
Floats are fundamentally not suitable to be checked for equality, and by evaluating your interpolation on the edges of its domain (or using floats as keys in dictionaries), you are doing exactly this.
Your solution using epsilons is a bit hacky, but honestly there probably is no more elegant way of working around this problem.
In general, having to check floats for equality can be a symptom of a bad design choice. You recognized this, because you mentioned that you were thinking of using datetime.date. (Which I agree, is overkill.)
The best way to go is to accept that the interpolation is not defined on the edges of its domain and to work this assumption into the design of the program. The exact solution then depends on what you want to do.
Did you consider using seconds or days instead of years? Maybe by using seconds, you can avoid querying your interpolation at the borders of its definition range? If you only use integer values of seconds, you can easily use them as keys in your dictionary.

Categories

Resources