getting usable values from statsmodels WLS - python

I'm using statsmodels' weighted least squares regression, but getting some really huge values.
Here's my code:
X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]])
y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])
w = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
temp_g = sm.WLS(y, X, w).fit()
Now, what I understand is that in WLS regression, just like in any linear regression problem, we provide the endog vector and the exog vector and the function can find the line of the best fit and tell us what the coefficients/regression parameters for each observation ought to be. For example, in my data, where each observation consists of 3 features, I'm expecting there to be 3 parameters.
So I fetch them like this:
parameters = temp_g.params # I'm hoping I've got this right! Or do I need to use "fittedvalues" instead?
The issue is that I'm getting really huge values like this:
temp g params :
[ -7.66645036e+198 -9.01935337e+197 5.86257969e+198]
or this:
temp g params :
[-2.77777778 -0.44444444 1.88888889]
Which is creating problems in further usage of these parameters, especially since I have some exponents to work with as well, and I need to raise e to the power of some of the regression parameters, which is proving impossible, given such big numbers. Because I keep getting overflow errors when using exp().
Is this normal? Am I doing something wrong? Or is there a specific way to make them useful?

Related

HMMLearn: Too Many Values to Unpack

I'm trying to use hmmlearn to get the most likely hidden state sequence from a Hidden Markov Model, given start probabilities, transition probabilities, and emission probabilities.
I have two hidden states and four possible emission values, so I'm doing this:
num_states = 2
num_observations = 4
start_probs = np.array([0.2, 0.8])
trans_probs = np.array([[0.75, 0.25], [0.1, 0.9]])
emission_probs = np.array([[0.3, 0.2, 0.2, 0.3], [0.3, 0.3, 0.3, 0.1]])
model = hmm.MultinomialHMM(n_components=num_states)
model.startprob_ = start_probs
model.transmat_ = trans_probs
model.emissionprob_ = emission_probs
seq = np.array([[3, 3, 2, 2]]).T
model.fit(seq)
log_prob, state_seq = model.decode(seq)
My stack trace points to the decode call and throws this error:
ValueError: too many values to unpack (expected 2)
I thought decode (looking at the docs) returns a log probability and the state sequence, so I'm confused.
Any idea?
Thanks!
The call model.fit(seq) requires seq to be a list of lists, as you correctly set it up like this.
However, model.decode(seq) requires seq to only be a list, not a list of lists. Thus,
model.fit([[3, 3, 2, 2]])
log_prob, state_seq = model.decode([3, 3, 2, 2])
should work without throwing an error.
See also here.
The error ValueError: too many values to unpack (expected 2) is thrown from a function called by a function called by a function... inside decode. So, the error does not mean that the number of returned objects of decode was wrong, but from framelogprob.shape somewhere inside the base.py. A more meaningful error message would make life easier here.
I had the same issue and it drove me crazy. Hope my post helps somebody.

scipy.stats.weibull_min.fit() - how to deal with right-censored data?

Non-Censored (Complete) Dataset
I am attempting to use the scipy.stats.weibull_min.fit() function to fit some life data. Example generated data is contained below within values.
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 20683.2,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
I attempt to fit using the function:
fit = scipy.stats.weibull_min.fit(values, loc=0)
The result:
(1.3392877335100251, -277.75467055900197, 9443.6312323849124)
Which isn't far from the nominal beta and eta values of 1.4 and 10000.
Right-Censored Data
The weibull distribution is well known for its ability to deal with right-censored data. This makes it incredibly useful for reliability analysis. How do I deal with right-censored data within scipy.stats? That is, curve fit for data that has not experienced failures yet?
The input form might look like:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, np.inf,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
or perhaps using np.nan or simply 0.
Both of the np solutions are throwing RunTimeWarnings and are definitely not coming close to the correct values. I using numeric values - such as 0 and -1 - removes the RunTimeWarning, but the returned parameters are obviously flawed.
Other Softwares
In some reliability or lifetime analysis softwares (minitab, lifelines), it is necessary to have two columns of data, one for the actual numbers and one to indicate if the item has failed or not yet. For instance:
values = np.array(
[10197.8, 3349.0, 15318.6, 142.6, 0,
6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)
censored = np.array(
[True, True, True, True, False,
True, True, True, True, True]
)
I see no such paths within the documentation.
Old question but if anyone comes across this, there is a new survival analysis package for python, surpyval, that handles this, and other cases of censoring and truncation. For the example you provide above it would simply be:
import surpyval as surv
values = np.array([10197.8, 3349.0, 15318.6, 142.6, 6976.5, 2590.7, 11351.7, 10177.0, 3738.4])
# 0 = failed, 1 = right censored
censored = np.array([0, 0, 0, 0, 0, 1, 1, 1, 0])
model = surv.Weibull.fit(values, c=censored)
print(model.params)
(10584.005910580288, 1.038163987652635)
You might also be interested in the Weibull plot:
model.plot(plot_bounds=False)
Weibull plot
Full disclosure, I am the creator of surpyval

Python KNN weighting during .predict()?

I'm using a KNN algorithm for a class (instructed to use this algorithm, may not what you'd expect for application, see below)
Essentially, we have a raspberry pi set up to collect the signal strengths of 6 local WIFI router Mac addresses. At different locations on a floor of our building we've recorded these signal strengths in .csv files.
Using python I've created a script which uses the functions on this page. http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
I fit a knn as below:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1, algorithm = 'auto')
knn.fit(strengths, labels)
where strengths is a nested array like this:
[[Loc1strengths],[Loc2strengths],[Loc3strengths],[Loc4strengths],[Loc5strengths],[Loc6strengths]]
labels is set up like this:
[Loc1, Loc2, Loc3, Loc4, Loc5, Loc6]
Later in the script, I collect the signal strengths of the 6 local WIFI router Mac addresses and try to predict the location of my pi using knn.predict() and hope to get the location of the pi, Location1 for example.
The results aren't great, it does a relatively poor job of figuring out where it is.
I was wondering if there was a way to weight the function of knn.predict() so that the neighbors of the most recent location are weighted more heavily, the pi won't move to the other side of the floor without crossing the other points.
Any help would be appreciated!
It's a little bit hacky, but you can do this using the weights parameter in KNeighborsClassifier. If you add your times as an extra feature and then write a custom distance function you can weight the distance between samples using time. A really simple example shown here:
def time_weight(x1, x2):
# I've added my time variable at the end of my features
time_diff = np.linalg.norm(x1[-1] - x2[-1])
feature_diff = np.linalg.norm(x1[:-1]-x2[:-1])
return time_diff*feature_diff
Some dummy data
X = np.array([[0, 1], [0, 0.5]])
time = np.array([0, 5]).reshape(-1, 1)
y = np.array([0, 1])
X_with_time = np.hstack((X, time))
Test that our weighted distance makes sense:
print(time_weight(np.array([0, 1, 0]), np.array([0, 0.75, 2])))
print(time_weight(np.array([0, 1, 0]), np.array([0, 0.75, 3])))
print(time_weight(np.array([0, 0.5, 5]), np.array([0, 0.75, 2])))
print(time_weight(np.array([0, 0.5, 5]), np.array([0, 0.75, 3])))
Output:
0.5
0.75
0.75
0.5
That's what I expect to see, if something is twice as far away in time, it is twice the distance. So now check it works with KNeighborsClassifier
X_with_time = np.hstack((X, time))
knn = KNeighborsClassifier(metric=time_weight, n_neighbors=1)
knn.fit(X_with_time, y)
print(knn.predict([[0, 0.75, 2]]))
print(knn.predict([[0, 0.75, 3]]))
Output:
[0]
[1]
Again that's what I expected to see. So looks like it's not too painful to do. I would recommend spending some time thinking about how you want to set up your distance function, as this will really affect the results.

Kolmogorov Smirnov Test in Spark (Python) not working?

I was doing a normality test in Python spark-ml and saw what I think is an bug.
Here is the setup, i have a data-set that is normalized (range -1, to 1).
When I do a histogram, i can clearly see that the data is NOT normal:
>>> prices_norm.histogram(10)
([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
[226, 269, 119, 95, 52, 26, 8, 2, 2, 5])
When I run the Kolmgorov-Smirnov test I get the following results:
>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")
>>> print testResults
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.46231145770077375
pValue = 1.742039845709087E-11
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
The Kolmgorov-Smirnov test defines the null hypothesis (H0) as: the data follows a specified distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).
In this case the p-value is very low, so we should reject the null hypothesis. This makes sense, as it is clearly not normal.
So why then, does it say:
Sample follows theoretical distribution
Isn't this wrong? Shouldn't it say that the sample does NOT follow a theoretical distribution? Am I missing something?
This was driving me crazy, so I went to look at the source code directly:
git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala
The code is correct, the null Hypothesis is set as:
object NullHypothesis extends Enumeration {
type NullHypothesis = Value
val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}
The verbiage of the string message is just restating the null hypothesis:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
________________________________________
H0
Arguably the verbiage is confusing as it could be interpreted both ways. But it is indeed correct.

PyMC Bernoulli model checking

I am currently trying to do model checking with PyMC where my model is a Bernoulli model and I have a Beta prior. I want to do both a (i) gof plot as well as (ii) calculate the posterior predictive p-value.
I have got my code running with a Binomial model, but I am quite struggling to find the right way of making a Bernoulli model working. Unfortunately, there is no example anywhere that I can work with. My code looks like the following:
import pymc as mc
import numpy as np
alpha = 2
beta = 2
n = 13
yes = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,0,0])
p = mc.Beta('p',alpha,beta)
surv = mc.Bernoulli('surv',p=p,observed=True,value=yes)
surv_sim = mc.Bernoulli('surv_sim',p=p)
mc_est = mc.MCMC({'p':p,'surv':surv,'surv_sim':surv_sim})
mc_est.sample(10000,5000,2)
import matplotlib.pylab as plt
plt.hist(mc_est.surv_sim.trace(),bins=range(0,3),normed=True)
plt.figure()
plt.hist(mc_est.p.trace(),bins=100,normed=True)
mc.Matplot.gof_plot(mc_est.surv_sim.trace(), 10/13., name='surv')
#here I have issues
D = mc.discrepancy(yes, surv_sim, p.trace())
mc.Matplot.discrepancy_plot(D)
The main problem I am having is in determining the expected values for the discrepancy function. Just using p.trace() does not work here, as these are the probabilities. Somehow, I need to incorporate the sample size here, but I am struggling to do that in a similar way as I would do it for a Binomial model. I am also not quite sure, if I am doing the gof_plot correctly.
Hope someone can help me out here! Thanks!
Per the discrepancy function doc string, the parameters are:
observed : Iterable of observed values (size=(n,))
simulated : Iterable of simulated values (size=(r,n))
expected : Iterable of expected values (size=(r,) or (r,n))
So you need to correct two things:
1) modify your simulated results to have size n (e.g., 13 in your example):
surv_sim = mc.Bernoulli('surv_sim', p=p, size=n)
2) encapsulate your p.trace() with the bernoulli_expval method:
D = mc.discrepancy(yes, surv_sim.trace(), mc.bernoulli_expval(p.trace()))
(bernoulli_expval just spits back p.)
With those two changes, I get the following:

Categories

Resources