Avoid interpolation problems with numpy float - python

I need to handle timespans in a library I am creating. My first idea was to keep it simple and codify them as years, with float.
The problems arise, for instance, when I wish to perform interpolations. Say I have
xs = np.array([0, 0.7, 1.2, 3.0]) # times
ys = np.array([np.nan, 124.3, 214.0, np.nan]) # values associated
Outside the [0.7, 1.2] interval I would like to get the value np.nan, but inside, the obvious linear interpolation, particularly
in the extremes.
However, using
#!/usr/bin/python3.5
import numpy as np
from fractions import Fraction
import scipy.interpolate as scInt
if __name__ == "__main__":
xs = np.array([0, 0.7, 1.2, 3.0]) # times
ys = np.array([np.nan, 124.3, 214.0, np.nan]) # values associated
interp = scInt.interp1d(xs, ys)
xsInt = np.array([0, 7/10, 6/5-0.0001, 6/5, 6/5+0.0001])
print(interp(xsInt))
I get
[nan, 124.3, 213.98206, nan, nan]
So, the correct value for 7/10, but a nan for 6/5, which is 1.2. There is no mystery in this, machine representation of floats can cause
things like this. But anyway it is an issue I need to deal with.
My first idea was to double the values in fs, so that I would interpolate in
[x1-eps, x1+eps, x2-eps, x2+eps, ..., xn-eps, xn+eps], repeating twice the ys vector:
[y1, y1, y2, y2, y3, y3, ..., yn, yn]. This works, but it is quite ugly.
Then I though I would use fractions.Fraction instead, but Numpy complained saying that "object arrays are not supported".
A pity, this seemed the way to go, although surely there would be a loss of performance.
There is another side of this problem: it would be nice to be able to create dictionaries where the key is a time of the same kind, and I fear when I search
using a float as a key the same, some searches would fail due to the same issue.
My last idea was to use dates, like datetime.date, but I an not too happy with it because of the ambiguity when converting the difference between dates
to year fractions.
What would be the best approach for this, is there a nice solution?

I think there is just no easy way out of this.
Floats are fundamentally not suitable to be checked for equality, and by evaluating your interpolation on the edges of its domain (or using floats as keys in dictionaries), you are doing exactly this.
Your solution using epsilons is a bit hacky, but honestly there probably is no more elegant way of working around this problem.
In general, having to check floats for equality can be a symptom of a bad design choice. You recognized this, because you mentioned that you were thinking of using datetime.date. (Which I agree, is overkill.)
The best way to go is to accept that the interpolation is not defined on the edges of its domain and to work this assumption into the design of the program. The exact solution then depends on what you want to do.
Did you consider using seconds or days instead of years? Maybe by using seconds, you can avoid querying your interpolation at the borders of its definition range? If you only use integer values of seconds, you can easily use them as keys in your dictionary.

Related

Constraints on search space - python - scikit

I'm searching for minimum in 100d space. I'm using gp_minimize from skopt (python 3.6).
space = [(0., 1.) for _ in range(100)]
res = gp_minimize(f, space)
However, I also have a constraint that value in each subsequent dimension is not larger than in the previous dimensions. For example for the case of 5d, point [1, 0.9, 0.9, 0.8, 0.7] is ok, while point [1, 0.3, 0.5, 0.4, 0.2] is not.
How to add this constraint using skopt?
The best way I found is to modify the function f. Choose an upper bound for f, and everywhere in the domain where f is not supposed to be evaluated, have it return this upper bound.
It is straightforward to this that this is a mathematically sound approach, as it doesn't change the minimum and constrains your search space kind of in the same way that Lagrange multipliers do. However, I don't know if it plays nicely with the algorithm, since I don't really know how Bayesian optimization handles wide plateaus.

vectorized interpolation on array with nans

I am trying to interpolate an image cube NDIM=(dim_frequ, dim_spaxel1, dim_spaxel1) along the frequency axis. The aim is to oversample the frequency space. The array may contain nans. It would, of course, be possible to run two for loops over the array but that's definitely too slow.
What I want in pseudo code:
import numpy as np
from scipy.interpolate import interp1d
dim_frequ, dim_spaxel1, dim_spaxel2 = 2559, 70, 70
cube = np.random.rand(dim_frequ, dim_spaxel1, dim_spaxel2)
cube.ravel()[np.random.choice(cube.size, 1000, replace=False)] = np.nan
wavelength = np.arange(1.31, 2.5894999999, 5e-4) # deltaf so that len(wavelength)==DIMfrequ
wavelength_over = np.arange(1.31, 2.5894999999, 5e-5)
cube_over = interp1d(wavelength, cube, axis=0, kind='quadratic', fill_value="extrapolate")(wavelength_over)
cube_over[np.isnan(cube_over)] # array([], dtype=float64)
I've tried np.interp which can only handle 1D data (?)
I've tried scipy.interpolate.interp1d which can in principle handle
arrays along a given axis, but returns nans (I assume because of the
nans in the array)
This actually works in the case the kind is = 'linear'. I'd actually like it a bit fancier though, as soon as I set kind to 'quadratic' it returns nans.
I've tried the scipy.interpolate.CubicSpline
which raises a ValueError again because of the nans.
Any ideas what else to try? I am quite free in terms of the type of the interpolation, but it shouldn't be too fancy, i.e. nothing crazier than a spline or a low order polynomial
So a couple of things.
First
This returns all nan because cube_over has no nan in it after the above
cube_over[np.isnan(cube_over)]
Since np.isnan(cube_over) is all False
Otherwise it appears to be interpolating everything in the wavelength_over array.
Second
scipy doesn't like nans (see the docs) Typical practice is to drop the nan's from your set of points to interpolate since it typically will not add any value to the interpolation function.
Although it appears to be working with you interp1d example above. I am guessing it is dropped them along the axis when it builds the interpolation function, but I am not sure.
Third
What value do you actually want to interpolate? I am not sure what your desired output / endpoint is. It appears that your code is working more or less as expected. When you are interpolating you wavelength_over array. Seeing as they are so similar (if not the same value as the wavelength array. I think you might benefit from a 2d interpolation method but again I do not have a good understanding of your goal.
See 2d interpolation options in scipy docs
Hope this helps.

Is there a function in tensorflow for doing transformations that a functions of the indices?

I'm looking for (but have been completely unable to find) a function in tensorflow that will allow me to do a 'map' on a tensor.
map
Firstly, I'm not even sure if there is a 'map' function? By this a mean something that lets me apply a given f(x) to even entry in a tensor. e.g. I want something like this
def f(x):
return x**2
Y = tf.Variable(np.array([[1.0, 2.0],
[3.0, 4.0]])
Y = tf.map_function(X, f)
producing (after suitably running in a session, obviously) a tensor with values
Y = [[1.0, 4.0],
[9.0, 16.0]]
Does this exist (for general f - I realise that tf.nn.relu and tf.nn.sigmoid? On one hand, it seems like it should, sincemap` is a pretty fundamental operation. On the other hand, it would involve taking the supplied python function and somehow converting it to be executed on the GPU, and that sounds like something that might not be possible.
Am I asking for the moon on a stick here?
**mapi*
If such a function exists, is there a version that allows me to use an index-aware f? e.g.
def f(x, i):
if (i != [0, 0]):
k2 = np.sum([x**2 for x in i])
else:
k2 = 1.0 # To avoid division by zero
return (x / k2)
Y = tf.Variable(np.ones(shape=(2,3)))
Y = tf.mapi_function(X, f)
producing
Y = [[1.0, 1.0, 0.25],
[1.0, 0.5, 0.2]]
If such function don't exist, would it be possible (for fixed f) for me to add them by building tensorflow from (slightly modified) source?
Why I need such a function
The reason I'm asking this is that I'm trying to use tensorflow to numerically integrate a PDE. As part of that I need to compute the laplacian (d^2/dx^2 + d^2/dy^2 + d^2/dz^2) u(x,y,z). In a Fourier-transformed representation of the field u(k_X, k_y, k_z) this involves dividing by k_x^2 + k_y^2 + k_z^2.
I could precompute a tensor of inverse squared wavenumber values and dow an element-wise multiply. But this would use up a lot of memory. I suspect it would also be slower to load those values from memory.
In your specific example of wanting to map individually to each of the x,y,z coordinates, you can accomplish this readily with tf.split() and tf.stack(). That is, I presume you have an input tensor (call it K) that is of size [n,m,...,3]; that is, where the last dimension indexes the x,y,z coordinates. If so, then use tf.split() to break up K into Kx,Ky,Kz. Then apply your map operation (I use tf.map_fn() for this purpose typically), and then finally stack things back together with tf.stack().
If I understand the setup correctly that should do it. If not, please provide a minimal working example that will make the problem concrete; otherwise we are at best guessing at a solution.

How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!

Distributions and p-values in python

I have a big list of numbers, and I would like to create a distribution out of this data, plot it, then find the p-value for every number in my list with regards to the distribution.
Is it possible to do this in python? I can't find it in the matplotlib documentation. Should i be using something else?
I would suggest to look into the stats module of scipy; it offers numerous statistical functions for things like this. For plotting, I would still use matplotlib.
You can use the searchsorted function from the numpy module, which will give you the order of a set of values in an ordered array. you can then transform it to a pvalue just by renormalizing it to the dimension of the original array:
data = sorted(rand(10))
new_data = rand(5)
pvals = searchsorted(data,new_data)*1./len(data)
print pvals
#array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
Well, in fact if you want the pvalues of the original number you don't need any special function at all: the pvalues are just the order in the sorted dataset divided by it's lenght.
If you need the pvalues of new values in respect to your original ones, you can use the snippet I gave you

Categories

Resources