apply fitted model to data and obtain loglikelihood - python

I would like to do the following in Python, preferably with the statsmodels package (but if you know a solution with another package, I would be glad to hear about it as well):
I have data olddata and predictors predictors. I used
import statsmodels.api as sta
model = sta.GLM(olddata,predictors,family=sta.families.Binomial())
fitted = model.fit()
loglikelihood = fitted.llf
to fit a model and obtain the loglikelihood. Now I would like to find out, how well this fitted model describes a new dataset newdata. If I just used
loglikelihood = sta.GLM(newdata,predictors,family=sta.families.Binomial()).fit().llf
I would of course just obtain the loglikelihood for a fitted model with new weights for my new data. What I would like to obtain, however, is the llhood for the old model given the new data. I would be glad, if someone could tell me how this can be done without manually calculating the loglikelihood.
Thanks a lot in advance

I don't think there is a way to change the data in a model, however the loglikelihood can be evaluated for any parameters.
In this case, we can create a new model with the new data, but evaluate the model.loglike at the old parameter estimate, something like
model_new = sta.GLM(newdata, predictors, family=sta.families.Binomial())
llf2 = model_new.loglike(fitted_old.params)
where fitted_old is the old estimation results instance.
This should work with statsmodels master, but I don't remember when loglike was added to work this way.
Note, result_instance.llf is the value of the loglikelihood at the estimate, model_instance.loglike is the method that can be evaluated at any params.

Related

using warm_start on a 1D-Gaussian Mixture Model does not seem to work - old fits gets still neglected?

Here you can see the plot of the newly fit model:
the bins show all the now available data, so the initial data used to fit the model and the new data. The new data does not include the higher values.
These are the model parameters:
GaussianMixture(max_iter=10000, n_components=2, tol=0.0001, warm_start=True)
so warm_start certainly is set to true. When sampling from the model i also do not receive the high values. So it does not seem to be an error in the plot either.
When fitting the model, which is called gmm, with new data i simply do
gmm_new = gmm.fit(new_data)
The new data is already expanded in dimensions so that this works.
When fitting the model again with new AND old data, so the whole dataset, the results look fine though. But wouldn't that mean that I fitted the old data twice?
Am I using the warm-start wrong?
Well, as turns out the glossary holds the answer:
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset.
So it does make sense that the results seem to be good when fiting again on the whole data set

Sklearn /scikit learn using fit method

how does fit() method works in sklearn.preproessing using Imputer class
what does exactly fit() do in back ground how it is necessary for below code and
everywhere im seeing fitting what fitting with what , why and how ?
from sklearn.preprocessing import Imputer
impt = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
impt = impt.fit(X[:,1:3])
X[:,1:3] = impt.transform(X[:,1:3])
The idea is to 'fit' your pre-processing on your training data only (as you would your model). It will learn some state, for the imputer this might be the mean of your feature. Then when you transform on your test / validation data, you use the state (i.e. the mean in this case) to impute the new unseen data. Using this design, it makes it really easy to avoid data leaks. Consider if you had imputed on your entire dataset. The mean that you use for the imputation now uses some of the information from your supposedly unsees test data. This is a data leak, your data is no longer truly unseen. Scikit-learn uses the fit / transform pattern to easily mitigate this common pitfall in machine learning.
Furthermore, because ALL sklearn transformers and estimators use this fit API, you can chain them up in a pipeline making it possible to do all your pre-processing easily on each fold of a k-fold cross-validation, which otherwise would be a very fiddly, tricky thing to do without errors.
Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
The above line creates an Imputer object which will impute/replace the missing values which are denoted as NaN's with the mean value of the values.
impt = impt.fit(X[:,1:3])
So it needs some data from which it can calculate mean which can be replaced by the missing values. This is normally done by a method fit which will calculate the values needed, mean in this case. The fit takes in some data to calculate these values and it is normally called the training phase.
impt.transform(X[:,1:3])
Once the values are calculated they can be used on the new data presented to it. In this case, it will replace the missing data with the calculated (in fit method ) mean. This is done via a transform method.
Sometimes one might want to run fit and transform of the same data. In such cases instead of calling fit followed by transform we can use fit_transform method.
X[:,1:3] = impt.fit_transform(X[:,1:3])
Well, the aim of "fit" in preprocessing stage is to compute the necessary values (like min and max of each variable). Then with this value scikit learn can then preprocess your data but it couldn't before. It is also useful because you can then re use your preprocessor object later.
You can also use fit_transform if you like to do these 2 steps in one.

Statsmodels Mixed Linear Model predictions

I am estimating a Mixed Linear Model using the statsmodels MixedLM package in Python. After fitting the model, I now want to make predictions but am struggling to understand the 'predict' method.
The statsmodels documentation (http://www.statsmodels.org/dev/generated/statsmodels.regression.mixed_linear_model.MixedLM.predict.html) suggests that the predict method takes an array containing the parameters of the model that has been estimated. How can I retrieve this array?
y = raw_data['dependent_var']
X = raw_data[['var1', 'var2', 'var3']]
groups = raw_data['person_id']
model = sm.MixedLM(endog=y, exog=X, groups=groups)
result = model.fit()
I know I am late by few months but it's good to answer if someone else is having the same question. The params required are available in the result object. They are result.fe_params
model.predict(reresult.fe_params, exog=xest)
or with result object
result.predict(exog=xtest)
To answer the user11806155's question, to make predictions purely on fixed effects, you can do
model.predict(reresult.fe_params, exog=xtest)
To make predictions on random effects, you can just change the parameters with specifying the particular group name (e.g. "group1")
model.predict(reresult.random_effects["group1"], exog=xtest).
I assume the order of features in the test data should follow the same order as what you give as the model's parameters. You can add them together to get the prediction specifically for a group.

Using cross_val_predict for predictions

I have the following code where I want to use k-fold cross validation for a Linear Regression model:
kf = KFold(n_splits=100)
predi = cross_val_predict(model, train[columns], train[target], cv = kf)
predi = pandas.Series(predi)
model.fit(data[columns], data[target])
pred_test = model.predict(test[columns])
print(mean_squared_error(pred_test, test[target]))
However, I am not sure whether the code does what I would like it to do. Specifically, I am not sure about the model.fit part. Does it even use the cross-validation?
The reason why I am not sure that calculating it like this yields worse results than without cross-validation.
No. CV is just for checking the performance of model on a data (or rather different parts of it)
When you call fit(), it will fit the whole data supplied at the time whereas cross-validation only uses parts of the data (leaving 1 fold in each iteration). So this data difference may cause the estimator to perform better or worse.
model.fit doesn't have any functionality to divide the data. It just works on the cost function minimization problem and creates a model (means find parameters).
Also if you think that you create a loop and you divide the data on every iteration and call model.fit again and again you get the more generalized model, then it's not possible because on calling fit 2nd time on linear regression model object, it forgets about old data.

Set the weights of decision functions through stdin in Sklearn

Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.
This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.
Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.

Categories

Resources