Can you use counts in sklearn logistic regression input? - python

So, I know that in R you can provide data for a logistic regression in this form:
model <- glm( cbind(count_1, count_0) ~ [features] ..., family = 'binomial' )
Is there a way to do something like cbind(count_1, count_0) with sklearn.linear_model.LogisticRegression? Or do I actually have to provide all those duplicate rows? (My features are categorical, so there would be a lot of redundancy.)

If they are categorical - you should provide binarized version of it. I don't know how that code in R works, but you should binarize your categorical feature always. Because you have to emphasize that each value of your feature is not related to other one, i.e. for feature "blood_type" with possible values 1,2,3,4 your classifier must learn that 2 is not related to 3, and 4 is not related to 1 in any sense. These is achieved by binarization.
If you have too many features after binarization - you can reduce dimensionality of binarized dataset by FeatureHasher or more sophisticated methods like PCA.

Related

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?
I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

CatBoost Post-Training Feature Information

I would like to understand how I can access information about numerical and categorical features after training a CatBoost model. For the sake of example, here's some toy code:
import pandas as pd
from catboost import CatBoostClassifier, Pool
train_pool = Pool(pd.DataFrame({'size': [1,1,2,1],
'shape': ['square','square','square', 'circle']}),
[1,1,0,1],
feature_names = ['size','shape'],
cat_features= ['shape'])
model = CatBoostClassifier(iterations=2,
cat_features = ['shape'],
ctr_leaf_count_limit=1)
model.fit(train_pool, plot=False)
I would now like to run a function on the model object to obtain the following:
Numerical Feature size has minimum value 0, and max value 1 (this should be part of CatBoosts split logic for numerical features)
Categorical Feature shape has the following training values:
values=['square', None].
Notice that circle is not in values because the car_leaf_count_limit=1 would have selected the most occurring value, which in this case is 'square'. I've put None here because I'm pretty sure cat boost will assign None to any unseen classes.
Next, I've chosen the above data example to make sure that CatBoost decides to split on shape=='square'. Ideally I'd like to see an array used_values=['square'] which emphasizes that there was at least one split on this square value.
It's important to emphasize here that I want to operate on the model object only. Obviously, one can get some of these details by running functions onto of the training data. My motivation is to double-make-sure that I completely understand the training-range of inputs into the model, and what it may do to them in preprocessing.

Python PLSRegression : obtaining the latent variables scores using loadings

In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?
Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset

Python: Transform my X_train distribution for machine learning

My DataFrame looks like this:
X_Train = Wind_direction from 0 to 360 degree => Xaxis
Y_Train = Energy_Production => Yaxis
How can I transform my X_variable in order to obtain better results on my machine learning problem?
Optimal direction seems to be around 140 and 340 degrees
Some models depends heavily on the input data distribution, like neural networks or other gradient based techniques.
Some models do not really care about the distribution, like decison trees, random forests etc.
I would suggest to try out different normalization techniques, like the StandardScaler (z-score) or a MinMaxScaler for a given range (like rescaling on [0.0, 1.0]).
In the end there is no universal answer which normalization techniques performans best, since it is dependend on the problem and the machine learning algorithm itself.

How do I rank a list of features that I have picked out in the given data and rank the features?

I am currently doing ML on house prediction dataset. > dataset
How do I select 10 best features in my dataset for aiding in building a model for predicting SalePrice and how to rank the 10 features ?
You can use a feature ranking algorithm like "ReliefF"or use a dimensionality reduction technique like PCA, which will transform all of your features into your desired number of features.
This is also a good link for learning more about feature selection:
https://machinelearningmastery.com/feature-selection-machine-learning-python/
You might also need to convert strings to numbers in features where the values are alphanumeric in nature.

Categories

Resources