Statsmodels Mixed Linear Model predictions - python

I am estimating a Mixed Linear Model using the statsmodels MixedLM package in Python. After fitting the model, I now want to make predictions but am struggling to understand the 'predict' method.
The statsmodels documentation (http://www.statsmodels.org/dev/generated/statsmodels.regression.mixed_linear_model.MixedLM.predict.html) suggests that the predict method takes an array containing the parameters of the model that has been estimated. How can I retrieve this array?
y = raw_data['dependent_var']
X = raw_data[['var1', 'var2', 'var3']]
groups = raw_data['person_id']
model = sm.MixedLM(endog=y, exog=X, groups=groups)
result = model.fit()

I know I am late by few months but it's good to answer if someone else is having the same question. The params required are available in the result object. They are result.fe_params
model.predict(reresult.fe_params, exog=xest)
or with result object
result.predict(exog=xtest)

To answer the user11806155's question, to make predictions purely on fixed effects, you can do
model.predict(reresult.fe_params, exog=xtest)
To make predictions on random effects, you can just change the parameters with specifying the particular group name (e.g. "group1")
model.predict(reresult.random_effects["group1"], exog=xtest).
I assume the order of features in the test data should follow the same order as what you give as the model's parameters. You can add them together to get the prediction specifically for a group.

Related

using warm_start on a 1D-Gaussian Mixture Model does not seem to work - old fits gets still neglected?

Here you can see the plot of the newly fit model:
the bins show all the now available data, so the initial data used to fit the model and the new data. The new data does not include the higher values.
These are the model parameters:
GaussianMixture(max_iter=10000, n_components=2, tol=0.0001, warm_start=True)
so warm_start certainly is set to true. When sampling from the model i also do not receive the high values. So it does not seem to be an error in the plot either.
When fitting the model, which is called gmm, with new data i simply do
gmm_new = gmm.fit(new_data)
The new data is already expanded in dimensions so that this works.
When fitting the model again with new AND old data, so the whole dataset, the results look fine though. But wouldn't that mean that I fitted the old data twice?
Am I using the warm-start wrong?
Well, as turns out the glossary holds the answer:
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset.
So it does make sense that the results seem to be good when fiting again on the whole data set

Confidence score for machine learning with SciKit Learn?

I have followed an example of applying SciKit Learning's machine learning to facial recognition.
https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
I have been able to adapt the example to my own data successfully. However, I am lost on one point:
after preparing the data, training the model, ultimately, you end up with the line:
Y_pred = clf.predict(X_test_pca)
This produces a vector of predictions, one per face.
What I can't figure out is how to get any confidence measurement to correspond with that.
The classification method is a forced choice, so that each face passed in MUST be classified as one of the known faces, even if it isn't even close.
How can I get a number per face that will reflect how well the result matches the known face?
It seems like you are looking for the .predict_proba() method of the scikit-learn estimators. It returns the probabilities of possible outcomes instead of a single prediction.
The example you are referring to is using an SVC. It is a little special in regard to this function as it states:
The model need to have probability information computed at training time: fit with attribute probability set to True.
So, if you are using the same model as in the example, instantiate it with:
SVC(kernel='rbf', class_weight='balanced', probability=True)
and use .predict_proba() instead of .predict():
y_pred = clf.predict_proba(X_test_pca)
This returns an array of shape (n_samples, n_classes), i.e. the probabilities for each class for each sample. Accessing the probabilities for class k could then be done by calling y_pred[k] for example.

How to scale the input data for trained model?

I have a trained model that uses regression to predict house prices. It was trained on a standardized dataset (StandatdScaler from sklearn). How do I scale my models input (a single example) now in a different python program? I can't use StandardScaler on the input, because all features would be reduced to 0 (MinMaxScaler doesn't work either, also tried saving and loading scaler from the training script - didn't work). So, how can I scale my input so that features won't be 0 allowing the model to predict the price correctly?
What you've described is a contradiction in terms. Scaling refers to a range of data; a single datum does not have a "range"; it's a point.
What you seem to be asking is how to scale the input data to fit the translation you made when you trained. The answer here is straightforward again: you have to use the same translation function you applied when you trained. Standard practice is to revert the model's ingestion (i.e. reverse that scaling function); if you didn't do that, and you didn't make any note of that function's coefficients, then you do not have the information needed to apply the same translation to future input -- in short, your trained model isn't particularly useful.
You could try to recover the coefficients by running the scaling function on the original data set, making sure to output the resulting function. Then you could apply that function to your input examples.

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

apply fitted model to data and obtain loglikelihood

I would like to do the following in Python, preferably with the statsmodels package (but if you know a solution with another package, I would be glad to hear about it as well):
I have data olddata and predictors predictors. I used
import statsmodels.api as sta
model = sta.GLM(olddata,predictors,family=sta.families.Binomial())
fitted = model.fit()
loglikelihood = fitted.llf
to fit a model and obtain the loglikelihood. Now I would like to find out, how well this fitted model describes a new dataset newdata. If I just used
loglikelihood = sta.GLM(newdata,predictors,family=sta.families.Binomial()).fit().llf
I would of course just obtain the loglikelihood for a fitted model with new weights for my new data. What I would like to obtain, however, is the llhood for the old model given the new data. I would be glad, if someone could tell me how this can be done without manually calculating the loglikelihood.
Thanks a lot in advance
I don't think there is a way to change the data in a model, however the loglikelihood can be evaluated for any parameters.
In this case, we can create a new model with the new data, but evaluate the model.loglike at the old parameter estimate, something like
model_new = sta.GLM(newdata, predictors, family=sta.families.Binomial())
llf2 = model_new.loglike(fitted_old.params)
where fitted_old is the old estimation results instance.
This should work with statsmodels master, but I don't remember when loglike was added to work this way.
Note, result_instance.llf is the value of the loglikelihood at the estimate, model_instance.loglike is the method that can be evaluated at any params.

Categories

Resources