Kolmogorov Smirnov Test in Spark (Python) not working?

Kolmogorov Smirnov Test in Spark (Python) not working? - python

I was doing a normality test in Python spark-ml and saw what I think is an bug.
Here is the setup, i have a data-set that is normalized (range -1, to 1).
When I do a histogram, i can clearly see that the data is NOT normal:
>>> prices_norm.histogram(10)
([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
[226, 269, 119, 95, 52, 26, 8, 2, 2, 5])
When I run the Kolmgorov-Smirnov test I get the following results:
>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")
>>> print testResults
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.46231145770077375
pValue = 1.742039845709087E-11
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
The Kolmgorov-Smirnov test defines the null hypothesis (H0) as: the data follows a specified distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).
In this case the p-value is very low, so we should reject the null hypothesis. This makes sense, as it is clearly not normal.
So why then, does it say:
Sample follows theoretical distribution
Isn't this wrong? Shouldn't it say that the sample does NOT follow a theoretical distribution? Am I missing something?

This was driving me crazy, so I went to look at the source code directly:
git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala
The code is correct, the null Hypothesis is set as:
object NullHypothesis extends Enumeration {
type NullHypothesis = Value
val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}
The verbiage of the string message is just restating the null hypothesis:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
________________________________________
H0
Arguably the verbiage is confusing as it could be interpreted both ways. But it is indeed correct.

Related

How to Use NumPy 1.4 Polynomial Class to Fit Values

How do you use the new Polynomials sub-package in numpy to give it new x values and get an output of y values?
https://numpy.org/doc/stable/reference/routines.polynomials.package.html
In prior versions of numpy it went something like this:
poly = np.poly1d(np.polyfit(x, y, 3)
new_x = np.linspace(0, 100)
new_y = poly(new_x)
The new version I am struggling to give it x values that give me the y values of each?
from numpy.polynomial import Polynomial
poly = Polynomial(Polynomial.fit(x, y, 3))
When I give it an array of x it just returns the coefficients.

You can directly call the resulting series to evaluate it:
from numpy.polynomial import Polynomial
poly = Polynomial.fit(x, y, 3)
new_y = poly(new_x)
Check this page of the documentation it has several examples.

Unfortunately, the answer by #Joan Charmant and the supportive comment #rh109019 do not work.
The intuitive way suggested by #Joan Charmant is, basically, what the question's about: it doesn't work.
Evidently, there is a new method introduced in numpy.polynomial.polynomial devoted specifically to evaluating polynomials. See here.
Here's my code where I'm comparing the two approaches.
import numpy as np
Pgauge = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
NIST = np.asarray([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1])
calibrationCurve = np.polynomial.polynomial.Polynomial.fit(Pgauge,
NIST,
deg=1
)
print("The polynomial: {}".format(calibrationCurve))
x = np.asarray([0, 1]) # values of x to evaluate the polynomial at
c = calibrationCurve.coef # coefficients of the polynomial
print("The intuitive (wrong) way: {}".format(calibrationCurve(x)))
print("The correct way: {}".format(np.polynomial.polynomial.polyval(x, c)))
The first print command prints out the polynomial:4.6+3.5x.
If we want to evaluate it at the points 0 and 1 (x = np.asarray([0, 1])), we expect to get 4.6 and 8.1 respectively.
The second print command (that reads "The intuitive (wrong) way"), uses the method suggested by #Joan Charmant. It gives [0.1, 1.1] as the result. Which is wrong. Though seemingly, it looks ok: it gives two numbers as expected. But the numbers themselves are wrong. I don't know how these numbers were calculated. But if I had a bigger series of data, I wouldn't go with a calculator through it and assume I've got a correct result.
The last print command makes use of the polyval method suggested in the user manual that I cited above. And it works perfectly well. It gives [4.6, 8.1] as the result.
It so happens that my answer is wrong as well (see all the comments below by #user2357112 supports Monica).
But still, I'll leave it here for the folks who, like me, fell the victim of the confusing new numpy.polynomial library.
FIRST: why my code is wrong?
Everything's ok with it. But the line print("The polynomial: {}".format(calibrationCurve)) doesn't give me what, I think, it must give me. It takes the correct polynomial, changes its coefficients somehow and prints out a new polynomial with the changed coefficients. Still, it does store the correct polynomial in its memory and when you do the thing suggested by #Joan Charmant it may give you the correct answer if you ask it properly.
SECOND: how to use the new numpy.polynomial library in order to get a correct result?
Due to that peculiarity, you have to introduce a new line of code. Namely, do the Polynomial.fit() and immediately afterwards use the .convert() method. Then work with the converted polynomial only.
Here's my code that works correctly now.
import numpy as np
Pgauge = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
NIST = np.asarray([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1])
calibrationCurveMessedUp = np.polynomial.polynomial.Polynomial.fit(Pgauge,
NIST,
deg=1
)
calibrationCurve = calibrationCurveMessedUp.convert()
print("The polynomial: {}".format(calibrationCurve))
print("The rounded polynomial coefficients: {}".format(calibrationCurve.coef))
x = np.asarray([0, 1]) # values of x to evaluate the polynomial at
print(calibrationCurve(x))
THIRD: a little note.
Apparently, there is a possibility to get the correct polynomial without the additional line of code. Probably, you have to give the correct window and domain parameters to the Polynomial.fit() function. Or may be there is another way.
If anybody knows such a way, you're welcome to edit my current answer and add your code.

Scipy optimize raises ValueError despite x0 being within bounds

I'm trying to fit a sigmoid curve onto a small set of points, basically generating a probability curve from a set of observations. I'm using scipy.optimize.curve_fit, with a slightly modified logistic function (so as to be bound completely within [0,1]). Currently I have had the greatest success with the dogbox method, and an exact tr_solver.
When I attempt to run the code, for certain data points it will raise:
ValueError: `x0` violates bound constraints.
I did not run into this issue (using the same code and data) until I updated to the most recent version of numpy/scipy (numpy 1.17.0, scipy 1.3.1), so I believe it to be a result of this update (I cannot downgrade, as other libraries that I require for other aspects of this project require these versions)
I'm running this on a large dataset (N ~15000), and for very specific values the curve fit fails, claiming that the initial guess is outside of the bound constraints. This is not the case, and even checking quickly via the print statement before the curve fit in the provided example confirms this.
At first I had thought that it was a numpy precision error and that a value this small was considered to be out of bounds, but altering it slightly or providing a new, arbitrary number of a similar magnitude does not cause a ValueError. Additionally, other failed values are as big as ~1e-10, so I assume it must be something else.
Here is an example that fails for me every time:
import numpy as np
import scipy as sp
from scipy.special import expit, logit
import scipy.optimize
def f(x,x0,g,c,k):
y = c*expit(k*10.*(x-x0)) + g*(1.-c)
return y
# x0 g c k
p0 = np.array([8.841357069490852e-01, 4.492363462957287e-19, 5.547073496706608e-01, 7.435378446218519e+00])
bounds = np.array([[-1.,1.], [0.,1.], [0.,1.], [0.,20.]])
x = np.array([1.0, 1.0, 1.0, 1.0, 1.0, 0.8911796599834791, 1.0, 1.0, 1.0, 0.33232919909076103, 1.0])
y = np.array([0.999, 0.999, 0.999, 0.999, 0.999, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001])
s = np.array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9])
print([pval >= b[0] and pval <= b[1] for pval,b in zip(p0,bounds)])
fit,cov = sp.optimize.curve_fit(f,x,y,p0=p0,sigma=s,bounds=([b[0] for b in bounds],[b[1] for b in bounds]),method='dogbox',tr_solver='exact')
print(fit)
print(cov)
Here is the specific error stack, everything after the above call to curve fit.
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\optimize\minpack.py", line 763, in curve_fit
**kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\optimize\_lsq\least_squares.py", line 927, in least_squares
tr_solver, tr_options, verbose)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\optimize\_lsq\dogbox.py", line 310, in dogbox
J = jac(x, f)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\optimize\_lsq\least_squares.py", line 874, in jac_wrapped
kwargs=kwargs, sparsity=jac_sparsity)
File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\optimize\_numdiff.py", line 362, in approx_derivative
raise ValueError("`x0` violates bound constraints.")
ValueError: `x0` violates bound constraints.
If anyone has any insight as to what may be causing this, I would greatly appreciate the help! I did some searching and couldn't find any answers that may relate to this scenario, so I decided to open this question up. Thanks!
EDIT 9/9/19:
np.__version__ is 1.17.2 and sp.__version__ is 1.3.1, when I originally posted this I was on numpy 1.17.0 but upgrading has not fixed the issue. I'm running this on Python 3.6.6 on 64-bit Windows 10.
If I change either the second or fourth bound to be +/-np.inf (or change both), then the code does in fact complete -- but I am still unsure how my x0 is invalid (and I still need to have the fit bounded to these values)
EDIT: 1/22/20
upgraded np.__version__ to 1.18.1 and sp.__version__ to 1.4.1, to no avail. I have opened an issue on the scipy github repository for this error. However, it seems that they are also unable to reproduce the issue and therefore cannot address it.

Horrible hack. Do not do it at home :) but if you just need to get work done at your own risk:
In
C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\optimize\_numdiff.py
Find:
if np.any((x0 < lb) | (x0 > ub)):
raise ValueError("`x0` violates bound constraints.")
Replace with:
if np.any(((x0 - lb) < -1e-12) | (x0 > ub)):
raise ValueError("`x0` violates bound constraints.")
Where -1e-12 is what you think your case can tolerate as an error of your bound constraint (x0-lb) < 0. Here x0 is a guess and lb is a lower bound.
I do not know what numerical horrors would result out of this hack. But if you just want to get going...

scipy.stats.weibull_min.fit() - how to deal with right-censored data?

Non-Censored (Complete) Dataset
I am attempting to use the scipy.stats.weibull_min.fit() function to fit some life data. Example generated data is contained below within values.
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 20683.2,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
I attempt to fit using the function:
fit = scipy.stats.weibull_min.fit(values, loc=0)
The result:
(1.3392877335100251, -277.75467055900197, 9443.6312323849124)
Which isn't far from the nominal beta and eta values of 1.4 and 10000.
Right-Censored Data
The weibull distribution is well known for its ability to deal with right-censored data. This makes it incredibly useful for reliability analysis. How do I deal with right-censored data within scipy.stats? That is, curve fit for data that has not experienced failures yet?
The input form might look like:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, np.inf,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
or perhaps using np.nan or simply 0.
Both of the np solutions are throwing RunTimeWarnings and are definitely not coming close to the correct values. I using numeric values - such as 0 and -1 - removes the RunTimeWarning, but the returned parameters are obviously flawed.
Other Softwares
In some reliability or lifetime analysis softwares (minitab, lifelines), it is necessary to have two columns of data, one for the actual numbers and one to indicate if the item has failed or not yet. For instance:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 0,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
censored = np.array(
[True, True, True, True, False,
True, True, True, True, True]
)
I see no such paths within the documentation.

Old question but if anyone comes across this, there is a new survival analysis package for python, surpyval, that handles this, and other cases of censoring and truncation. For the example you provide above it would simply be:
import surpyval as surv
values = np.array([10197.8, 3349.0, 15318.6, 142.6, 6976.5, 2590.7, 11351.7, 10177.0, 3738.4])
# 0 = failed, 1 = right censored
censored = np.array([0, 0, 0, 0, 0, 1, 1, 1, 0])
model = surv.Weibull.fit(values, c=censored)
print(model.params)
(10584.005910580288, 1.038163987652635)
You might also be interested in the Weibull plot:
model.plot(plot_bounds=False)
Weibull plot
Full disclosure, I am the creator of surpyval

getting usable values from statsmodels WLS

I'm using statsmodels' weighted least squares regression, but getting some really huge values.
Here's my code:
X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]])
y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])
w = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
temp_g = sm.WLS(y, X, w).fit()
Now, what I understand is that in WLS regression, just like in any linear regression problem, we provide the endog vector and the exog vector and the function can find the line of the best fit and tell us what the coefficients/regression parameters for each observation ought to be. For example, in my data, where each observation consists of 3 features, I'm expecting there to be 3 parameters.
So I fetch them like this:
parameters = temp_g.params # I'm hoping I've got this right! Or do I need to use "fittedvalues" instead?
The issue is that I'm getting really huge values like this:
temp g params :
[ -7.66645036e+198 -9.01935337e+197 5.86257969e+198]
or this:
temp g params :
[-2.77777778 -0.44444444 1.88888889]
Which is creating problems in further usage of these parameters, especially since I have some exponents to work with as well, and I need to raise e to the power of some of the regression parameters, which is proving impossible, given such big numbers. Because I keep getting overflow errors when using exp().
Is this normal? Am I doing something wrong? Or is there a specific way to make them useful?

scikit Mixtypes of Y error

Hi I'm a scikit newbie here. I'm trying to train the computer that given an array of float decide between the 3 classes. I was classifying the classes as 0, 0.5, and 1. I also tried 0, 1.0, and 2.0 . I still get the following error:
File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line 85, in unique_labels
raise ValueError("Mix type of y not allowed, got types %s" % ys_types)
ValueError: Mix type of y not allowed, got types set(['continuous', 'multiclass'])
I have no idea what that error means

Try using integer types for your target labels. Or, perhaps better, use string labels like ['a', 'b', 'c'] but with more descriptive names.
If you check the code for this file multiclass.py (code is here) and look for the function type_of_target, you'll see that it is well-documented for this case.
Because some of the data are treated as float type (when 0.5 is included), it will believe you've got continuous-valued outputs, which won't do for multiclass discrete classification.
On the other hand, it will look at [0, 1.0, 2.0] like it is one integer and two floats, which is why you get both continuous and multiclass. Switching the last example to [0, 1, 2] should work. The documentation also makes it sound like switching to [0.0, 1.0. 2.0] would also work, but be careful and test that first.

Its hard to tell for sure without the code, but my guess is that the shape of your y data is not what is expected.
For example when my code threw this error it was because I was trying to pass y data into classification_report in the shape of (60000, 10, 2) when it was expecting it to be in the shape of (60000, 10)
I was re-running cells where I called to_categorical(y_test) more than once... When I loaded my code into a proper script and ran it it worked fine :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Kolmogorov Smirnov Test in Spark (Python) not working? - python

Related

How to Use NumPy 1.4 Polynomial Class to Fit Values

Scipy optimize raises ValueError despite x0 being within bounds

scipy.stats.weibull_min.fit() - how to deal with right-censored data?

getting usable values from statsmodels WLS

scikit Mixtypes of Y error

Categories

Resources