ValueError: A value in x_new is below the interpolation range - python

This is a scikit-learn error that I get when I do
my_estimator = LassoLarsCV(fit_intercept=False, normalize=False, positive=True, max_n_alphas=1e5)
Note that if I decrease max_n_alphas from 1e5 down to 1e4 I do not get this error any more.
Anyone has an idea on what's going on?
The error happens when I call
my_estimator.fit(x, y)
I have 40k data points in 40 dimensions.
The full stack trace looks like this
File "/usr/lib64/python2.7/site-packages/sklearn/linear_model/least_angle.py", line 1113, in fit
axis=0)(all_alphas)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/polyint.py", line 79, in __call__
y = self._evaluate(x)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/interpolate.py", line 498, in _evaluate
out_of_bounds = self._check_bounds(x_new)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/interpolate.py", line 525, in _check_bounds
raise ValueError("A value in x_new is below the interpolation "
ValueError: A value in x_new is below the interpolation range.

There must be something particular to your data. LassoLarsCV() seems to be working correctly with this synthetic example of fairly well-behaved data:
import numpy
import sklearn.linear_model
# create 40000 x 40 sample data from linear model with a bit of noise
npoints = 40000
ndims = 40
numpy.random.seed(1)
X = numpy.random.random((npoints, ndims))
w = numpy.random.random(ndims)
y = X.dot(w) + numpy.random.random(npoints) * 0.1
clf = sklearn.linear_model.LassoLarsCV(fit_intercept=False, normalize=False, max_n_alphas=1e6)
clf.fit(X, y)
# coefficients are almost exactly recovered, this prints 0.00377
print max(abs( clf.coef_ - w ))
# alphas actually used are 41 or ndims+1
print clf.alphas_.shape
This is in sklearn 0.16, I don't have positive=True option.
I'm not sure why you would want to use a very large max_n_alphas anyway. While I don't know why 1e+4 works and 1e+5 doesn't in your case, I suspect the paths you get from max_n_alphas=ndims+1 and max_n_alphas=1e+4 or whatever would be identical for well behaved data. Also the optimal alpha that is estimated by cross-validation in clf.alpha_ is going to be identical. Check out Lasso path using LARS example for what alpha is trying to do.
Also, from the LassoLars documentation
alphas_ array, shape (n_alphas + 1,)
Maximum of covariances (in
absolute value) at each iteration. n_alphas is either max_iter,
n_features, or the number of nodes in the path with correlation
greater than alpha, whichever is smaller.
so it makes sense that we end with alphas_ of size ndims+1 (ie n_features+1) above.
P.S. Tested with sklearn 0.17.1 and positive=True as well, also tested with some positive and negative coefficients, same result: alphas_ is ndims+1 or less.

Related

Least Squares method in practice

Very simple regression task. I have three variables x1, x2, x3 with some random noise. And I know target equation: y = q1*x1 + q2*x2 + q3*x3. Now I want to find target coefs: q1, q2, q3 evaluate the
performance using the mean Relative Squared Error (RSE) (Prediction/Real - 1)^2 to evaluate the performance of our prediction methods.
In the research, I see that this is ordinary Least Squares Problem. But I can't get from examples on the internet how to solve this particular problem in Python. Let say I have data:
import numpy as np
sourceData = np.random.rand(1000, 3)
koefs = np.array([1, 2, 3])
target = np.dot(sourceData, koefs)
(In real life that data are noisy, with not normal distribution.) How to find this koefs using Least Squares approach in python? Any lib usage.
#ayhan made a valuable comment.
And there is a problem with your code: Actually there is no noise in the data you collect. The input data is noisy, but after the multiplication, you don't add any additional noise.
I've added some noise to your measurements and used the least squares formula to fit the parameters, here's my code:
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
estimated_theta = np.linalg.inv(data.T # data) # data.T # noisy_measurements
The estimated_theta will be close to true_theta. If you don't add noise to the measurements, they will be equal.
I've used the python3 matrix multiplication syntax.
You could use np.dot instead of #
That makes the code longer, so I've split the formula:
MTM_inv = np.linalg.inv(np.dot(data.T, data))
MTy = np.dot(data.T, noisy_measurements)
estimated_theta = np.dot(MTM_inv, MTy)
You can read up on least squares here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#The_general_problem
UPDATE:
Or you could just use the builtin least squares function:
np.linalg.lstsq(data, noisy_measurements)
In addition to the #lhk answer I have found great scipy Least Squares function. It is easy to get the requested behavior with it.
This way we can provide a custom function that returns residuals and form Relative Squared Error instead of absolute squared difference:
import numpy as np
from scipy.optimize import least_squares
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
#noisy_measurements[-1] = data[-1] # (1000 * true_theta) - uncoment this outliner to see how much Relative Squared Error esimator works better then default abs diff for this case.
def my_func(params, x, y):
res = (x # params) / y - 1 # If we change this line to: (x # params) - y - we will got the same result as np.linalg.lstsq
return res
res = least_squares(my_func, x0, args=(data, noisy_measurements) )
estimated_theta = res.x
Also, we can provide custom loss with loss argument function that will process the residuals and form final loss.

Resistivity non linear fit in Python

I'm trying to fit Einstein approximation of resistivity in a solid in a set of experimental data.
I have resistivity vs temperature (from 200 to 4 K)
import xlrd as xd
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
import scipy as sp
from scipy.optimize import curve_fit
#retrieve data from file
data = pl.loadtxt('salita.txt')
Temp = data[:, 1]
Res = data[:, 2]
#define fitting function
def einstein_func( T, ro0, AE, TE):
nl = np.sinh(TE/(2*T))
return ro0 + AE*nl*T
p0 = sp.array([1 , 1, 1])
coeffs, cov = curve_fit(einstein_func, Temp, Res, p0)
But I get these warnings
crio.py:14: RuntimeWarning: divide by zero encountered in divide
nl = np.sinh(TE/(2*T))
crio.py:14: RuntimeWarning: overflow encountered in sinh
nl = np.sinh(TE/(2*T))
crio.py:15: RuntimeWarning: divide by zero encountered in divide
return ro0 + AE*np.sinh(TE/(2*T))*T
crio.py:15: RuntimeWarning: overflow encountered in sinh
return ro0 + AE*np.sinh(TE/(2*T))*T
crio.py:15: RuntimeWarning: invalid value encountered in multiply
return ro0 + AE*np.sinh(TE/(2*T))*T
Traceback (most recent call last):
File "crio.py", line 19, in <module>
coeffs, cov = curve_fit(einstein_func, Temp, Res, p0)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/optimize/minpack.py", line 511, in curve_fit
raise RuntimeError(msg)
RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
I don't understand why it keeps saying that there is a divide by zero in sinh, since I have strictly positive values. Varying my starting guess has no effect on it.
EDIT: My dataset is organized like this:
4.39531E+0 1.16083E-7
4.39555E+0 -5.92258E-8
4.39554E+0 -3.79045E-8
4.39525E+0 -2.13213E-8
4.39619E+0 -4.02736E-8
4.43130E+0 -1.42142E-8
4.45900E+0 -2.60594E-8
4.46129E+0 -9.00232E-8
4.46181E+0 1.42142E-7
4.46195E+0 -2.13213E-8
4.46225E+0 4.26426E-8
4.46864E+0 -2.60594E-8
4.47628E+0 1.37404E-7
4.47747E+0 9.47612E-9
4.48008E+0 2.84284E-8
4.48795E+0 1.35035E-7
4.49804E+0 1.39773E-7
4.51151E+0 -1.75308E-7
4.54916E+0 -1.63463E-7
4.59176E+0 -2.36902E-9
where the first column is temperature and the second one is resistivity (negative values are due to noise in trial current since the sample is a PbIn alloy which becomes superconductive at temperature lower than 6.7-6.9K, here we are at 4.5K).
Argument I'm providing to sinh are Numpy arrays, with a linear function ro0 + AE*T my code works. I've tried with scipy.optimize.minimize but the result is the same.
Now I see that I have almost nine hundred values in my file, may that be the problem?
I have edited my dataset removing some lines and now the only warning showing is
RuntimeWarning: overflow encountered in sinh
How can I workaround it?
Here are a couple of observations that could help:
You could try the least-squares fit directly with leastsq, providing the Jacobian, which might help tame it.
I'm guessing you don't want the superconducting temperatures in your data set at all if you're fitting to an Einstein model (do you have a source for this eqn, btw?)
Do make sure your initial guesses are as good as they could possibly be (ro0=AE=TE=1 probably won't cut it).
Plot your data and make sure there aren't any weird artefacts
You seem to be indexing your data array in the wrong way in your code example: if the data is structured as you say, you want:
Temp = data[:, 0]
Res = data[:, 1]
(Python indexes start at 0).

Scipy minimize, fmin, leastsq type problems (setting array element with sequence), bad fit

I'm having trouble with the scipy.optimize.fmin and scipy.optimize.minimize functions. I've checked and confirmed that all the arguments passed to the function are of type numpy.array, as well as the return value of the error function. Also, the carreau function returns a scalar value.
The reason for some of the extra arguments, such as size, is this: I need to fit data with a given model (Carreau). The data is taken at different temperatures, which are corrected with a shift factor (which is also fitted by the model), I end up with several sets of data which should all be used to calculate the same 4 constants (parameters p).
I read that I can't pass the fmin function a list of arrays, so I had to concatenate all data into x_data_lin, keeping track of the different sets with the size parameter. t holds different test temperatures, while t_0 is a one-element array which holds the reference temperature.
I am positive (triple checked) that all the arguments passed to the function, as well as the result, are one-dimensional arrays. Here's the code aside from that:
import numpy as np
import scipy.optimize
from scipy.optimize import fmin as simplex
def err_func2(p, x, y, t, t_0, size):
result = array([])
temp = 0
for i in range(0, int(len(size)-1)):
for j in range(int(temp), int(temp+size[i])):
result = np.append(result, (carreau(p, x[j], t[i], t_0[0])-y[i]))
temp += size[i]
return result
p1 = simplex(err_func2, initial_guess,
args=(x_data_lin, y_data_lin, t_list, t_0, size), full_output=0)
Here's the error:
Traceback (most recent call last):
File "C:\Python27\Scripts\projects\Carreau - WLF\carreau_model_fit.py", line 146, in <module>
main()
File "C:\Python27\Scripts\projects\Carreau - WLF\carreau_model_fit.py", line 105, in main
args=(x_data_lin, y_data_lin, t_list, t_0, size), full_output=0)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 351, in fmin
res = _minimize_neldermead(func, x0, args, callback=callback, **opts)
File "C:\Python27\lib\site-packages\scipy\optimize\optimize.py", line 415, in _minimize_neldermead
fsim[0] = func(x0)
ValueError: setting an array element with a sequence.
It's worth noting that I got the leastsq function working while passing it lists of arrays. Unfortunately, it did a poor job of fitting the data. But, as it took me a lot of time and research to get to that point, I'll post the code as follows. If somebody is interested in seeing all of the code, I would gladly post it, if you can recommend me somewhere to upload a few files(as it includes another imported script and of course sample data):
##def error_function(p, x, y, t, t_0):
## result = array([])
## for index in range(len(x)):
## result = np.append(result,(carreau(p, x[index],
## t[index], t_0) - y[index]))
## return result
## p1, success = scipy.optimize.leastsq(error_function, initial_guess,
## args=(x_list, y_list, t_list, t_0),
## maxfev=10000)
:( I was going to post a picture of the graphed data with the leastsq fit, but I don't have the requisite 10 points.
Late Edit: I now have gotten optimize.curvefit and optimize.leastsq to work (which probably not-so-coincidentally give the same answer), but the curve is bad. I've been trying to figure out optimize.minimize, but it's been a bit of a headache. the simplex (fmin, Nelder_Mead, whatever you want to call it) will run, but produces a crazy answer nowhere close. I've never worked with nonlinear optimization problems before, and I don't really know what direction to head.
Here's the working curve_fit code:
def temp_shift(t_s, t, t_0):
""" This function calculates the a_t temperature shift factor for polymer
viscosity curves. Variable is the standard temperature, t_s
"""
C_1 = 8.86
C_2 = 101.6
return(np.exp(
(C_1*(t_0-t_s) / (C_2+(t_0-t_s))) - (C_1*(t-t_s) / (C_2 + (t-t_s)))
))
def pass_data(t, t_0):
def carreau_2(x, p0, p1, p2, p3):
visc_0 = p0
m = p1
n = p2
t_s = p3
a_T = temp_shift(p3, t, t_0)
return (visc_0 * a_T / (1 + m * x * a_T)**n)
return carreau_2
initial_guess = array([20000, 3, 0.94, -20])
p1, conv = scipy.optimize.curve_fit(pass_data(t_all, t_0), x_data_lin,
y_data_lin, initial_guess)
Here's some sample data:
x_data_lin = array([0.01998, 0.04304, 0.2004, 0.43160, 0.92870, 2.0000, 4.30900,
9.28500, 15.51954, 21.94936, 37.52960, 90.41786, 204.35230,
331.58495, 811.92250, 1694.55309, 3464.27648, 8826.65738,
14008.00242])
y_data_lin = array([13520.00000, 13740.00000, 12540.00000, 9384.00000, 5201,
3232.00000, 2094.00000, 1484.00000, 999.00000, 1162.05088
942.56946, 705.62489, 429.47341, 254.15136, 185.22916,
122.07113, 76.46324, 47.85064, 25.74315, 18.84875])
t_all = array([190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190, 190,
190, 190, 190, 190, 190, 190, 190])
t_0 = 80
Here's a picture of the result of curve_fit (now that I have 10 points and can post!). Note there are 3 curves drawn because I used 3 sets of data to optimize the curve, at 3 different temperatures. Polymers have the property that the shear_rate - viscosity relationship stays the same, just shifted by a temperature factor a_T:
I'd really appreciate any suggestions about how to improve the fit, or how to define the function so that optimize.minimize works, and which method (Nelder-Mead, Powel, BFGS) might work.
Another edit to add: I got the Nelder-Mead (optimize.fmin, and the default of optimize.minimize) function to work - I'll include the revised error function below. Before, I simply summed the result array and returned it. This led to extremely negative values (obviously, since the function's goal is to minimize). Squaring the result before summing it solved that problem. Note that I also changed the function completely to take advantage of numpy's array broadcasting, as suggested by JaminSore (Thanks Jamin!)
def err_func2(p, x, y, t, t_0):
return ((carreau(p, x, t, t_0)-y)**2).sum()
Unfortunately, the Nelder-Mead function gives me the same result as leastsq and curve_fit. You can see in the graph above that it's not the optimal fit; in fact, at this point, Microsoft Excel's solver function is doing a better job on the data.
At least, I hope this thread can be useful for beginners to scipy.optimize in the future, since it's taken me quite awhile to discover all of this.
Unlike leastsq, fmin can only deal with error functions that return a scalar so if possible you have to rewrite your error function so that it returns a scalar. Here is a simple working example.
Import the necessary libraries
import numpy as np
from scipy.optimize import fmin
Define a helper function (you'll see later)
def prob(a, b):
return (1 + np.exp(b - a))**-1
Simulate some data
true_ = np.random.normal(size = 100) #parameters we're trying to recover
b = np.random.normal(size = 20)
exp_ = prob(true_[:, None], b) #expected
a_s, b_s = true_.shape[0], b.shape[0]
noise = np.random.uniform(size = (a_s, b_s))
response = (noise > (1 - exp_)).astype(int)
Define our error function (I'm using lambdas but this is not recommended in practice)
# sum of the squared residuals
err_func = lambda a : ((prob(a[:, None], b) - response) ** 2).sum()
result = fmin(err_func, np.zeros_like(true_)) #solve
If I remove the .sum() at the end of my error function definition, I get the same error.
OK, now I finally know the answer! First, the final piece, then a recap. The problem of the fit wasn't the fault of curve_fit, leastsq, Nelder_Mead, or Powell (the methods I've tried). It has to do with the relative weights of the errors. Since this data is on a log scale, the errors in the fit near the high y values are very costly, while errors near the low y values are insignificant. To correct this, I made the error relative by dividing by the y value of the data, as follows:
def err_func2(p, x, y, t, t_0):
return (((carreau(p, x, t, t_0)-y)/y)**2).sum()
Now, each relative error is squared, summed, then minimized, giving the following fit (using optimize.minimize with the Powell method, although it should be the same for the other methods as well.)
So now a recap of the answers found in this thread:
The easiest way (or at least for me, most fool-proof) to deal with curve fitting is to collect all the data into 1D numpy.arrays. Then, you can rely on numpy's array broadcasting to perform all operations. This means that arithmetic operations are treated the same way a vector dot-product would be. For example, array_1 = [a,b], array_2 = [c,d], then array_1 + array_2 = [a+c, b+d]. This works for addition, subtraction, multiplication, division, and powers: array+1array_2 = [ac, b**d].
For the optimize.leastsq function, you need to let the objective function return an array; i.e. return result where result is an array. For optimize.curve_fit, you also return an array. In this case, it's a bit more complicated to pass extra arguments (think other constants), but you can do it with a nested function, as I demonstrated above in the pass_data function.
For optimize.minimize, you need to return a scalar - that is, a single number. You can also return an array of answers, I think, but I avoided this by getting all the data into 1D arrays, as I mentioned earlier. To get this scalar, you can simply square and sum the result (like I have written in this post under err_func2) Squaring the data is very important, otherwise negative errors take over and drive the resulting scalar extremely negative.
Finally, as mentioned, when your data crosses several scales (105, 104, 10**3, etc), it may be necessary to normalize the errors. I did this by dividing each error by the y value.
So... I guess that's it? Finally?

scipy.optimize.curvefit() - array must not contain infs or NaNs

I am trying to fit some data to a curve in Python using scipy.optimize.curve_fit. I am running into the error ValueError: array must not contain infs or NaNs.
I don't believe either my x or y data contain infs or NaNs:
>>> x_array = np.asarray_chkfinite(x_array)
>>> y_array = np.asarray_chkfinite(y_array)
>>>
To give some idea of what my x_array and y_array look like at either end (x_array is counts and y_array is quantiles):
>>> type(x_array)
<type 'numpy.ndarray'>
>>> type(y_array)
<type 'numpy.ndarray'>
>>> x_array[:5]
array([0, 0, 0, 0, 0])
>>> x_array[-5:]
array([2919, 2965, 3154, 3218, 3461])
>>> y_array[:5]
array([ 0.9999582, 0.9999163, 0.9998745, 0.9998326, 0.9997908])
>>> y_array[-5:]
array([ 1.67399000e-04, 1.25549300e-04, 8.36995200e-05,
4.18497600e-05, -2.22044600e-16])
And my function:
>>> def func(x,alpha,beta,b):
... return ((x/1)**(-alpha) * ((x+1*b)/(1+1*b))**(alpha-beta))
...
Which I am executing with:
>>> popt, pcov = curve_fit(func, x_array, y_array)
resulting in the error stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 426, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kw)
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 338, in leastsq
cov_x = inv(dot(transpose(R),R))
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 285, in inv
a1 = asarray_chkfinite(a)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
I'm guessing the error might not be with respect to my arrays, but rather an array created by scipy in an intermediate step? I've had a bit of a dig through the relevant scipy source
files, but things get hairy pretty quickly debugging the problem that way. Is there something obvious I'm doing wrong here? I've seen casually mentioned in other questions that sometimes certain initial parameter guesses (of which I currently don't have any explicit) might result in these kind of errors, but even if this is the case, it would be good to know a) why that is and b) how to avoid it.
Why it is failing
Not your input arrays are entailing nans or infs, but evaluation of your objective function at some X points and for some values of the parameters results in nans or infs: in other words, the array with values func(x,alpha,beta,b) for some x, alpha, beta and b is giving nans or infs over the optimization routine.
Scipy.optimize curve fitting function uses Levenberg-Marquardt algorithm. It is also called damped least square optimization. It is an iterative procedure, and a new estimate for the optimal function parameters is computed at each iteration. Also, at some point during optimization, algorithm is exploring some region of the parameters space where your function is not defined.
How to fix
1/Initial guess
Initial guess for parameters is decisive for the convergence. If initial guess is far from optimal solution, you are more likely to explore some regions where objective function is undefined. So, if you can have a better clue of what your optimal parameters are, and feed your algorithm with this initial guess, error while proceeding might be avoided.
2/Model
Also, you could modify your model, so that it is not returning nans. For those values of the parameters, params where original function func is not defined, you wish that objective function takes huge values, or in other words that func(params) is far from Y values to be fitted.
Also, at points where your objective function is not defined, you may return a big float, for instance AVG(Y)*10e5 with AVG the average (so that you make sure to be much bigger than average of Y values to be fitted).
Link
You could have a look at this post: Fitting data to an equation in python vs gnuplot
Your function has a negative power (x^-alpha) this is the same as (1/x)^(alpha). If x is ever 0 your function will return inf and your curve fit operation will break, I'm surprised a warning/error isn't thrown earlier informing you of a divide by 0.
BTW why are you multiplying and dividing by 1?
I was able to reproduce this error in python2.7 like so:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this produces the error:
Which always produces the Error:
/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py:58: RuntimeWarning: invalid value encountered in sqrt
return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
Traceback (most recent call last):
File "main.py", line 43, in <module>
ica()
File "main.py", line 18, in ica
ica.fit(X)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 523, in fit
self._fit(X, compute_sources=False)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 479, in _fit
compute_sources=compute_sources, return_n_iter=True)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 335, in fastica
W, n_iter = _ica_par(X1, **kwargs)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 108, in _ica_par
- g_wtx[:, np.newaxis] * W)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 55, in _sym_decorrelation
s, u = linalg.eigh(np.dot(W, W.T))
File "/usr/lib64/python2.7/site-packages/scipy/linalg/decomp.py", line 297, in eigh
a1 = asarray_chkfinite(a)
File "/usr/lib64/python2.7/site-packages/numpy/lib/function_base.py", line 613, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
Solution:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
#this is a column wise normalization function which flattens the
#two dimensional array from very large and very small numbers to
#reasonably sized numbers between roughly -1 and 1
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this works correctly.
Why does that normalization step fix the error?
I found the eureka moment here: sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"
What I think is happening is that numpy is being fed gigantic numbers and very tiny numbers, and inside it's tiny brain it's creating NaN's and Inf's. So it's a bug in the sklearn. The work around is to flatten your input data to the algorithm so that there are no very large or very small numbers.
Bad sklearn! NO biscuit!

Fitting a gamma distribution with (python) Scipy

Can anyone help me out in fitting a gamma distribution in python? Well, I've got some data : X and Y coordinates, and I want to find the gamma parameters that fit this distribution... In the Scipy doc, it turns out that a fit method actually exists but I don't know how to use it :s.. First, in which format the argument "data" must be, and how can I provide the second argument (the parameters) since that's what I'm looking for?
Generate some gamma data:
import scipy.stats as stats
alpha = 5
loc = 100.5
beta = 22
data = stats.gamma.rvs(alpha, loc=loc, scale=beta, size=10000)
print(data)
# [ 202.36035683 297.23906376 249.53831795 ..., 271.85204096 180.75026301
# 364.60240242]
Here we fit the data to the gamma distribution:
fit_alpha, fit_loc, fit_beta=stats.gamma.fit(data)
print(fit_alpha, fit_loc, fit_beta)
# (5.0833692504230008, 100.08697963283467, 21.739518937816108)
print(alpha, loc, beta)
# (5, 100.5, 22)
I was unsatisfied with the ss.gamma.rvs-function as it can generate negative numbers, something the gamma-distribution is supposed not to have. So I fitted the sample through expected value = mean(data) and variance = var(data) (see wikipedia for details) and wrote a function that can yield random samples of a gamma distribution without scipy (which I found hard to install properly, on a sidenote):
import random
import numpy
data = [6176, 11046, 670, 6146, 7945, 6864, 767, 7623, 7212, 9040, 3213, 6302, 10044, 10195, 9386, 7230, 4602, 6282, 8619, 7903, 6318, 13294, 6990, 5515, 9157]
# Fit gamma distribution through mean and average
mean_of_distribution = numpy.mean(data)
variance_of_distribution = numpy.var(data)
def gamma_random_sample(mean, variance, size):
"""Yields a list of random numbers following a gamma distribution defined by mean and variance"""
g_alpha = mean*mean/variance
g_beta = mean/variance
for i in range(size):
yield random.gammavariate(g_alpha,1/g_beta)
# force integer values to get integer sample
grs = [int(i) for i in gamma_random_sample(mean_of_distribution,variance_of_distribution,len(data))]
print("Original data: ", sorted(data))
print("Random sample: ", sorted(grs))
# Original data: [670, 767, 3213, 4602, 5515, 6146, 6176, 6282, 6302, 6318, 6864, 6990, 7212, 7230, 7623, 7903, 7945, 8619, 9040, 9157, 9386, 10044, 10195, 11046, 13294]
# Random sample: [1646, 2237, 3178, 3227, 3649, 4049, 4171, 5071, 5118, 5139, 5456, 6139, 6468, 6726, 6944, 7050, 7135, 7588, 7597, 7971, 10269, 10563, 12283, 12339, 13066]
If you want a long example including a discussion about estimating or fixing the support of the distribution, then you can find it in https://github.com/scipy/scipy/issues/1359 and the linked mailing list message.
Preliminary support to fix parameters, such as location, during fit has been added to the trunk version of scipy.
OpenTURNS has a simple way to do this with the GammaFactory class.
First, let's generate a sample:
import openturns as ot
gammaDistribution = ot.Gamma()
sample = gammaDistribution.getSample(100)
Then fit a Gamma to it:
distribution = ot.GammaFactory().build(sample)
Then we can draw the PDF of the Gamma:
import openturns.viewer as otv
otv.View(distribution.drawPDF())
which produces:
More details on this topic at: http://openturns.github.io/openturns/latest/user_manual/_generated/openturns.GammaFactory.html
1): the "data" variable could be in the format of a python list or tuple, or a numpy.ndarray, which could be obtained by using:
data=numpy.array(data)
where the 2nd data in the above line should be a list or a tuple, containing your data.
2: the "parameter" variable is a first guess you could optionally provide to the fitting function as a starting point for the fitting process, so it could be omitted.
3: a note on #mondano's answer. The usage of moments (mean and variances) to work out the gamma parameters are reasonably good for large shape parameters (alpha>10), but could yield poor results for small values of alpha (See Statistical methods in the atmospheric scineces by Wilks, and THOM, H. C. S., 1958: A note on the gamma distribution. Mon. Wea. Rev., 86, 117–122.
Using Maximum Likelihood Estimators, as that implemented in the scipy module, is regarded a better choice in such cases.

Categories

Resources