Sklearn GaussianMixture - python

I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?

So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)

Related

continuous hidden Markov models prediction

I am trying to predict the wind power of a wind farm using activepower, temperature, winddirection, windspeed to train my model. This is my first time working with hmms and I am confused on how to do a good prediction using continuous observations.
I am confused on how the mixture coefficient should be used in this prediction. In the following code, the mixture coefficient was left as 1, which is the default.
As well, should I be calculating the covariance matrix, mean vector, state transition matrix, and observation matrix? and how can this be done?
features=np.column_stack((activepower,temperature,winddirection,windspeed))
test_data=np.column_stack((activepower_2,temperature_2,winddirection_2,windspeed_2))
features_model = GaussianHMM(n_components=4)
features_model.fit(features)
results = features_model.score(test_data)
forecast,pred_states=features_model.sample(1008)
The code above gives me a prediction with a root mean square error (rmse) of 598.37. I know this can be improved by switching from a hold-out method to rolling window prediction. I am also using 4 hidden states for my model since it gave me the lowest rmse.

How can we interpret feature importances for Stochastic Gradient Descent Classifier?

I have a SGDClassifier model trained with scikit-learn. I extract features names with .get_feature_names() and coefficients with .coef_
I combine the 2 columns in a dataframe like this :
feature value
hiroshima 3.918584
wildfire 3.287680
earthquake 3.256817
massacre 3.186762
storm 3.124809
... ...
job -1.696438
song -1.736640
as -1.956571
nowplaying -2.028240
write -2.263968
I want to know how I can interpret the features importances ?
What does a positive high value mean?
What does a low negative value mean?
SGDClassifier fits a linear model, meaning that the decision is essentially based on
SUM_i w_i f_i + b
where w_i is the weight attached to feature f_i, consequently you can interpret these numbers as literally "votes" for positive/negative class at the scale proportional to their absolute value. All that your classifier does is to add these weights, and then it adds _intercept value from your model, and classifies based on the sign.

High AUC but bad predictions with imbalanced data

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?
The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in
preds = [1 if x >= 0.5 else 0 for x in preds]
This should not be the case here.
This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...
From a relevant answer at Cross Validated (emphasis added):
Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.
From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:
2.2. How to set the classification threshold for the testing set
Prediction
results
are
ultimately
determined
according
to
prediction
probabilities.
The
threshold
is
typically
set
to
0.5.
If
the
prediction
probability
exceeds
0.5,
the
sample
is
predicted
to
be
positive;
otherwise,
negative.
However,
0.5
is
not
ideal
for
some
cases,
particularly
for
imbalanced
datasets.
The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.
Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...
On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Outlier detection using Gaussian mixture

I have 5000 data points for each of my 17 features in a numpy array resulting in a 5000 x 17 array. I am trying to find the outliers for each feature using Gaussian mixture and I am rather confused on the following: 1)how many components should I use for my GaussiasnMixture? 2) Should I fit the GaussianMixture directly on the array of 5000 x 17 or to each feature column seperately resulting in 17 GaussianMixture models?
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array)
or
clf = mixture.GaussianMixture(n_components=17, covariance_type='full')
clf.fit(full_feature_array)
or
for feature in range(0, full_feature_matrix):
clf[feature] = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array[:,feature)
The task of selecting the number of components to model a distribution with a Gaussian mixture model is an instance of Model Selection. This is not so straightforward and there exist many approaches. A good summary can be found here https://en.m.wikipedia.org/wiki/Model_selection . One of the simplest and most widely used is to perform cross validation.
Normally outliers can be determined as those belonging to the component or components with the largest variance. You would call this strategy an unsupervised approach, however it can still be difficult to decide what the cutoff variance should be. A better approach (if applicable) is a supervised approach where you would train the GMM with outlier-free data (by manually removing outliers). You then use this to classify outliers as those which have particularly low likelihood scores. The second way to do it with a supervised approach would be to train two GMMs (one for outliers and one for inliers using model selection) then perform two-class classification for new data. Regarding your question about training univariate versus multivariate GMMs - it's difficult to say but for the purposes of outlier detection univariate GMMs ( or equivalently multivariate GMMs with diagonal covariance matrices) may be sufficient and require training fewer parameters compared to general multivariate GMMs, so I would start with that.
Using Gaussian Mixture Model (GMM) any point sitting on low-density area can be considered outlier - Perhaps the challenge is how to define low density area - For example you can say whatever lower than 4th quantile density is outlier.
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]
regarding choosing the number of component - look into "information criterion" provided by AIC or BIC given different number of components - they often agree in such cases. The lowest is better.
gm.bic(x)
gm.aic(x)
alternatively, BayesianGaussianMixture gives zero as weight to those clusters that are unnecessary.
from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=8, n_init=10) # n_components should be large enough
bgm.fit(X)
np.round(bgm.weights_, 2)
output
array([0.5 , 0.3, 0.2 , 0. , 0. , 0. , 0. , 0. ])
so here it the bayesian gmm detected there are three clusters.

Statsmodels Python Predict Linear Regression with one less predictor

I have trained a linear regression model with 20 predictors over a year long dataset. Below is x20 which is a list of arrays, each array is a predictor to be fed into the linear regression. y is the observations that I am fitting to, and model is the resulting linear regression model. The observations and predictors are being selected over a training period (all except for the last day (24 hours) which I will verify or predict over):
num_verifydays = 1
##############Train MOS model##################
x20=[predictor1[:-(num_verifydays)*24],predictor2[:-(num_verifydays)*24],
predictor3[:-(num_verifydays)*24],predictor4[:-(num_verifydays)*24],
predictor5[:-(num_verifydays)*24],predictor6[:-(num_verifydays)*24],
predictor7[:-(num_verifydays)*24],predictor8[:-(num_verifydays)*24],
predictor9[:-(num_verifydays)*24],predictor10[:-(num_verifydays)*24],
predictor11[:-(num_verifydays)*24],predictor12[:-(num_verifydays)*24],
predictor13[:-(num_verifydays)*24],predictor14[:-(num_verifydays)*24],
predictor15[:-(num_verifydays)*24],predictor16[:-(num_verifydays)*24],
predictor17[:-(num_verifydays)*24],predictor18[:-(num_verifydays)*24],
predictor19[:-(num_verifydays)*24],predictor20[:-(num_verifydays)*24]]
x20 = np.asarray(x20).T.tolist()
y = result_full['obs'][:-(num_verifydays)*24]
model = sm.OLS(y,x20, missing='drop').fit()
I want to predict using this model over my verification day using all 20 predictors and then just using 19 predictors to see how much of a difference there is in skill when using less predictors. I tried setting predictor20 to an array of zeros in x19 which you will see below but that seems to give me weird results:
##################predict with regression model##################
x20=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],predictor20[-(num_verifydays)*24:]]
x19=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],np.zeros(num_verifydays*24)]
x20 = np.asarray(x20).T.tolist()
x19 = np.asarray(x19).T.tolist()
results20 = model.predict(x20)
results19 = model.predict(x19)
You should fit two different models, one with 19 exogenous variables and the other with 20. This is much statistically sounder than testing the 20-variable model on the 19-variable set, because the fitted coefficients will be different.
model19 = sm.OLS(y,x19, missing='drop').fit()
model20 = sm.OLS(y,x20, missing='drop').fit()
What's the frequency of your data? Using a test data set of 1 day (n=1) isn't going to get you a very true picture of variable importance.
Other ways to look at the importance of this variable would be to look at the incremental R-squared added or lost between the two models.
Also consider checking out sklearn's feature_selection capabilities.

Categories

Resources