manually alter a sm.OLS coefficient - python

I'm running a multivariate regression in statsmodels. However, I would like to manually alter one of the coefficients for an independent variable prior to predicting. How would I go about doing that?
For example, say I train my data on a 2 year time period starting 4 years back. I return coefficients for wind, rain, and sun.
Now say that I train my data on the most recent 2 years of data and again get the coefficients in the regression output.
If I want to use the wind coefficient from the first regression output with the rain and sun coefficients from the second regression, how do I manually change wind prior to using predict?
EDIT:
Regression code/parameters:
model = sm.OLS(y[:train],X[:train]).fit()
predictions = model.predict(X[-test:])
Where X is [['rain','sun','wind']] and y is ['growth']

The prediction in OLS is just a linear function of the explanatory variables, x dot params.
my_params = results.params.copy()
my_params[2] = -99999
my_predict = x.dot(my_params)
I recommend not changing any numbers directly in the model, because then any inferential results are invalid for the changed model.
If you have known parameters, then you can estimate a restricted model, e.g. with GLM.fit_constrained, or add them to the offset in GLM.

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?
I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

continuous hidden Markov models prediction

I am trying to predict the wind power of a wind farm using activepower, temperature, winddirection, windspeed to train my model. This is my first time working with hmms and I am confused on how to do a good prediction using continuous observations.
I am confused on how the mixture coefficient should be used in this prediction. In the following code, the mixture coefficient was left as 1, which is the default.
As well, should I be calculating the covariance matrix, mean vector, state transition matrix, and observation matrix? and how can this be done?
features=np.column_stack((activepower,temperature,winddirection,windspeed))
test_data=np.column_stack((activepower_2,temperature_2,winddirection_2,windspeed_2))
features_model = GaussianHMM(n_components=4)
features_model.fit(features)
results = features_model.score(test_data)
forecast,pred_states=features_model.sample(1008)
The code above gives me a prediction with a root mean square error (rmse) of 598.37. I know this can be improved by switching from a hold-out method to rolling window prediction. I am also using 4 hidden states for my model since it gave me the lowest rmse.

Sklearn GaussianMixture

I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?
So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)

Statsmodels Python Predict Linear Regression with one less predictor

I have trained a linear regression model with 20 predictors over a year long dataset. Below is x20 which is a list of arrays, each array is a predictor to be fed into the linear regression. y is the observations that I am fitting to, and model is the resulting linear regression model. The observations and predictors are being selected over a training period (all except for the last day (24 hours) which I will verify or predict over):
num_verifydays = 1
##############Train MOS model##################
x20=[predictor1[:-(num_verifydays)*24],predictor2[:-(num_verifydays)*24],
predictor3[:-(num_verifydays)*24],predictor4[:-(num_verifydays)*24],
predictor5[:-(num_verifydays)*24],predictor6[:-(num_verifydays)*24],
predictor7[:-(num_verifydays)*24],predictor8[:-(num_verifydays)*24],
predictor9[:-(num_verifydays)*24],predictor10[:-(num_verifydays)*24],
predictor11[:-(num_verifydays)*24],predictor12[:-(num_verifydays)*24],
predictor13[:-(num_verifydays)*24],predictor14[:-(num_verifydays)*24],
predictor15[:-(num_verifydays)*24],predictor16[:-(num_verifydays)*24],
predictor17[:-(num_verifydays)*24],predictor18[:-(num_verifydays)*24],
predictor19[:-(num_verifydays)*24],predictor20[:-(num_verifydays)*24]]
x20 = np.asarray(x20).T.tolist()
y = result_full['obs'][:-(num_verifydays)*24]
model = sm.OLS(y,x20, missing='drop').fit()
I want to predict using this model over my verification day using all 20 predictors and then just using 19 predictors to see how much of a difference there is in skill when using less predictors. I tried setting predictor20 to an array of zeros in x19 which you will see below but that seems to give me weird results:
##################predict with regression model##################
x20=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],predictor20[-(num_verifydays)*24:]]
x19=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],np.zeros(num_verifydays*24)]
x20 = np.asarray(x20).T.tolist()
x19 = np.asarray(x19).T.tolist()
results20 = model.predict(x20)
results19 = model.predict(x19)
You should fit two different models, one with 19 exogenous variables and the other with 20. This is much statistically sounder than testing the 20-variable model on the 19-variable set, because the fitted coefficients will be different.
model19 = sm.OLS(y,x19, missing='drop').fit()
model20 = sm.OLS(y,x20, missing='drop').fit()
What's the frequency of your data? Using a test data set of 1 day (n=1) isn't going to get you a very true picture of variable importance.
Other ways to look at the importance of this variable would be to look at the incremental R-squared added or lost between the two models.
Also consider checking out sklearn's feature_selection capabilities.

Scikit learn linear regression predicting labels

I am trying to use SK learn to perform linear regression on time series labeled data.
My data format is data=(timestamp,value,label)
The labels that are assigned to my data are either 0 or 1.
I tried to follow this example from SKLearn website
My questions:
1- Where are the labels of the training data in the example ? Are they in diabetes_y_train ?
2- What are the return values of the method predict() ? In my code, it returns an array of n_samples as predicted values in the range [0,1]. However, I expected to have return binary values of either 0 or 1 (no intermediate values)
1 - diabetes_y_train are the labels for train
2 - You are using a regression function, so it is right to have continous variables. If you want to have binary output you are not solving a regression problem but a classification one you can then set a threshold to discretise the predictions or use one of the classifier offered by sklearn.
1 - Yes
2 - Predict calculates a floating point number, because the example is trying to predict a floating point value and not a binary value. So there is no yes/no answer, but a predictaed value, and to estimate the error, a difference is calculated and averaged in np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)

Categories

Resources