I have trained a linear regression model with 20 predictors over a year long dataset. Below is x20 which is a list of arrays, each array is a predictor to be fed into the linear regression. y is the observations that I am fitting to, and model is the resulting linear regression model. The observations and predictors are being selected over a training period (all except for the last day (24 hours) which I will verify or predict over):
num_verifydays = 1
##############Train MOS model##################
x20=[predictor1[:-(num_verifydays)*24],predictor2[:-(num_verifydays)*24],
predictor3[:-(num_verifydays)*24],predictor4[:-(num_verifydays)*24],
predictor5[:-(num_verifydays)*24],predictor6[:-(num_verifydays)*24],
predictor7[:-(num_verifydays)*24],predictor8[:-(num_verifydays)*24],
predictor9[:-(num_verifydays)*24],predictor10[:-(num_verifydays)*24],
predictor11[:-(num_verifydays)*24],predictor12[:-(num_verifydays)*24],
predictor13[:-(num_verifydays)*24],predictor14[:-(num_verifydays)*24],
predictor15[:-(num_verifydays)*24],predictor16[:-(num_verifydays)*24],
predictor17[:-(num_verifydays)*24],predictor18[:-(num_verifydays)*24],
predictor19[:-(num_verifydays)*24],predictor20[:-(num_verifydays)*24]]
x20 = np.asarray(x20).T.tolist()
y = result_full['obs'][:-(num_verifydays)*24]
model = sm.OLS(y,x20, missing='drop').fit()
I want to predict using this model over my verification day using all 20 predictors and then just using 19 predictors to see how much of a difference there is in skill when using less predictors. I tried setting predictor20 to an array of zeros in x19 which you will see below but that seems to give me weird results:
##################predict with regression model##################
x20=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],predictor20[-(num_verifydays)*24:]]
x19=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],np.zeros(num_verifydays*24)]
x20 = np.asarray(x20).T.tolist()
x19 = np.asarray(x19).T.tolist()
results20 = model.predict(x20)
results19 = model.predict(x19)
You should fit two different models, one with 19 exogenous variables and the other with 20. This is much statistically sounder than testing the 20-variable model on the 19-variable set, because the fitted coefficients will be different.
model19 = sm.OLS(y,x19, missing='drop').fit()
model20 = sm.OLS(y,x20, missing='drop').fit()
What's the frequency of your data? Using a test data set of 1 day (n=1) isn't going to get you a very true picture of variable importance.
Other ways to look at the importance of this variable would be to look at the incremental R-squared added or lost between the two models.
Also consider checking out sklearn's feature_selection capabilities.
Related
I have a dataset that consists of different features, like "gender". The task of the model is to determine if the annual income is above or below 50k.
Let say I have a trained network that does the classification.
Now I want to see how often the classifier makes false positive respectively false negative predictions by grouping them accordingly to the gender feature.
The basic idea is a confusion matrix of some sorts, but not a matrix of class to class but class to feature.
The image below illustrates the result I would like to have.
The basic idea is as follows:
1)Make a prediction with the Network.
2)Set the predicted values as new column in your Dataset, you now have a new dataset data_new
Your dataset now has two columns, one for the predicted and one for the true values. You can calculate the overall accuracy by boolean comparison (1 and 1 is right prediction and 0 and 1 and 1 and 0 are wrong predictions respectively).
3)Now you can filter the new data for any column you want, so in my case for the specific gender.
4)Now you can calculate the accuracy w.r.t to chosen gender.
Say I have df that looks like this
userId movieId rating
0 1 31 1
1 1 34 5
2 1 742 2
3 1 1013 4
4 2 31 1
...
I've splitted using stratified sampling to keep same user in both train/test set.
When training on train dataset I would usually initialize embedding matrices for user and movies and try to learn using SGD.
After two matrices learned say P, Q. I take dot_product(P_i, Q.T_j) to get prediction for (i,j)th position in rating's matrix.
Since P,Q are learned embedding seems correct to use this learned embedding to predict validation dataset. However simply validation_dataset - dot_product(P,Q) doesn't make sense because shape of train and valid dataset are different.
One way to do is from original dataset take-out known ratings and keep it as validation set. However I am wondering if there is a way to split data first then apply learned embeddings to predict test set (this seems more intuitive to me however do not know how to do it...)
The most widely accepted method to calculate test-set performance on collaborative-filtering systems is to keep some number of known user-item interactions separate, in the form of a test set. We exclude those test-set interactions from the training set, which is used to train the model.
After training, for each pair of user u and item i in the test set, we compute the model's predicted interaction score for u and i, and compare it with the known interaction score, which is either 1 or 0 (0 when negative-sampling is used). This is how we compute the test-set performance metrics.
If you use the model's predicted ratings/scores to create new data-points for the test-set, then it may not reflect the true generalization performance of the model on completely unseen/new data. Let me know if that answers your question, or if any clarifications are needed.
I'm running a multivariate regression in statsmodels. However, I would like to manually alter one of the coefficients for an independent variable prior to predicting. How would I go about doing that?
For example, say I train my data on a 2 year time period starting 4 years back. I return coefficients for wind, rain, and sun.
Now say that I train my data on the most recent 2 years of data and again get the coefficients in the regression output.
If I want to use the wind coefficient from the first regression output with the rain and sun coefficients from the second regression, how do I manually change wind prior to using predict?
EDIT:
Regression code/parameters:
model = sm.OLS(y[:train],X[:train]).fit()
predictions = model.predict(X[-test:])
Where X is [['rain','sun','wind']] and y is ['growth']
The prediction in OLS is just a linear function of the explanatory variables, x dot params.
my_params = results.params.copy()
my_params[2] = -99999
my_predict = x.dot(my_params)
I recommend not changing any numbers directly in the model, because then any inferential results are invalid for the changed model.
If you have known parameters, then you can estimate a restricted model, e.g. with GLM.fit_constrained, or add them to the offset in GLM.
Let me first explain about data set that I am using.
I have three set.
train with shape of (1277, 927), target is present about 12% of time
Eval set with shape of (174, 927), target is present about 11.5% of time
Hold out set with shape of (414, 927), target is present about 10% of time
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated
Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.
I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?
So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)