Python: Transform my X_train distribution for machine learning - python

My DataFrame looks like this:
X_Train = Wind_direction from 0 to 360 degree => Xaxis
Y_Train = Energy_Production => Yaxis
How can I transform my X_variable in order to obtain better results on my machine learning problem?
Optimal direction seems to be around 140 and 340 degrees

Some models depends heavily on the input data distribution, like neural networks or other gradient based techniques.
Some models do not really care about the distribution, like decison trees, random forests etc.
I would suggest to try out different normalization techniques, like the StandardScaler (z-score) or a MinMaxScaler for a given range (like rescaling on [0.0, 1.0]).
In the end there is no universal answer which normalization techniques performans best, since it is dependend on the problem and the machine learning algorithm itself.

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?
I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

Sklearn GaussianMixture

I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?
So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)

Outlier detection using Gaussian mixture

I have 5000 data points for each of my 17 features in a numpy array resulting in a 5000 x 17 array. I am trying to find the outliers for each feature using Gaussian mixture and I am rather confused on the following: 1)how many components should I use for my GaussiasnMixture? 2) Should I fit the GaussianMixture directly on the array of 5000 x 17 or to each feature column seperately resulting in 17 GaussianMixture models?
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array)
or
clf = mixture.GaussianMixture(n_components=17, covariance_type='full')
clf.fit(full_feature_array)
or
for feature in range(0, full_feature_matrix):
clf[feature] = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array[:,feature)
The task of selecting the number of components to model a distribution with a Gaussian mixture model is an instance of Model Selection. This is not so straightforward and there exist many approaches. A good summary can be found here https://en.m.wikipedia.org/wiki/Model_selection . One of the simplest and most widely used is to perform cross validation.
Normally outliers can be determined as those belonging to the component or components with the largest variance. You would call this strategy an unsupervised approach, however it can still be difficult to decide what the cutoff variance should be. A better approach (if applicable) is a supervised approach where you would train the GMM with outlier-free data (by manually removing outliers). You then use this to classify outliers as those which have particularly low likelihood scores. The second way to do it with a supervised approach would be to train two GMMs (one for outliers and one for inliers using model selection) then perform two-class classification for new data. Regarding your question about training univariate versus multivariate GMMs - it's difficult to say but for the purposes of outlier detection univariate GMMs ( or equivalently multivariate GMMs with diagonal covariance matrices) may be sufficient and require training fewer parameters compared to general multivariate GMMs, so I would start with that.
Using Gaussian Mixture Model (GMM) any point sitting on low-density area can be considered outlier - Perhaps the challenge is how to define low density area - For example you can say whatever lower than 4th quantile density is outlier.
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]
regarding choosing the number of component - look into "information criterion" provided by AIC or BIC given different number of components - they often agree in such cases. The lowest is better.
gm.bic(x)
gm.aic(x)
alternatively, BayesianGaussianMixture gives zero as weight to those clusters that are unnecessary.
from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=8, n_init=10) # n_components should be large enough
bgm.fit(X)
np.round(bgm.weights_, 2)
output
array([0.5 , 0.3, 0.2 , 0. , 0. , 0. , 0. , 0. ])
so here it the bayesian gmm detected there are three clusters.

GradientBoostingTree training with soft labels in scikit-learn?

I'm reconstructing a paper. They trained Gradient Boosting Regression Trees given the input X and soft targets y_s to get the final output y with minimum mean squared error. Regarding the paper they implemented all decision tree based methods using scitkit-learn package without any modification. This is what I want to do.
If you know the solution already I would be happy to hear, otherwise here are my thoughts:
Just for simplification assume we have a binary problem with
X = [[x1 x2 x3] [x1 x2 x3]...] and
y_s [[0.4 0.6][0.8 0.2]...].
Regarding the GradientBoostingTree for classification (see link above), I can only feed in a 1-dim class array
(y : array-like, shape = [n_samples]) Target values (integers in
classification, real numbers in regression) For classification, labels
must correspond to classes.
, so even when I would overwrite the cost function (e.g. to cross-entropy) which can handle soft labels, I'm still cannot feed in the 2 dim soft labels (at least).
Another idea was to reduce it to 1-dim by only take one soft label (only works for binary problem where both soft labels add up to 1) and use GradientBoostingRegression instead. But again only one class is possible and I can also not train independent models like
X = [[1,2,3], [1,2,3], [4,5,6]]
y = [[3.141, 2.718], [3.141, 2.718], [2.718, 3.141]]
rgr = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
rgr.fit(X, y)
X_test = [[1.5,2.5,3.5], [3.5,4.5,5.5]]
rgr.predict(X_test)
because of the correlation between the outputs..
Big picture:
1. Extraction of combined features
2.
a) Training: extracted features(Xb), original labels(y) -> logistic regression
b) Prediction: soft labels (yb)
3.
a) Training: original features (X), soft labels(yb) -> GradientBoostingTree
b) Evaluation: predicting normal labels (y_)
-> Importance of original features
The entire procedure without the soft labels is worthless. I mean it has to be possible somehow but I cannot figure out how...
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html
scikit-learn's docs on multi-output decision trees should point you in the right direction

Can you use counts in sklearn logistic regression input?

So, I know that in R you can provide data for a logistic regression in this form:
model <- glm( cbind(count_1, count_0) ~ [features] ..., family = 'binomial' )
Is there a way to do something like cbind(count_1, count_0) with sklearn.linear_model.LogisticRegression? Or do I actually have to provide all those duplicate rows? (My features are categorical, so there would be a lot of redundancy.)
If they are categorical - you should provide binarized version of it. I don't know how that code in R works, but you should binarize your categorical feature always. Because you have to emphasize that each value of your feature is not related to other one, i.e. for feature "blood_type" with possible values 1,2,3,4 your classifier must learn that 2 is not related to 3, and 4 is not related to 1 in any sense. These is achieved by binarization.
If you have too many features after binarization - you can reduce dimensionality of binarized dataset by FeatureHasher or more sophisticated methods like PCA.

Categories

Resources