How to reverse the OneHotEncoding to calculate the coef_ with sklearn?

How to reverse the OneHotEncoding to calculate the coef_ with sklearn? - python

After computing a linear regression and calculate my R2 score, I would like to calculate the coefficients (with the sklearn coef_ attribute ) of my features.
The point is that in my features I have some numerical and some categorical data. To process the LinearRegression() I have OneHotEncoded my categorical values. So in this case it is not possible to directly calculate the coef_ of each feature.
How would you recover the process to have one single column for each categorial feature (because OneHotEncoding means as many columns as possible values for each categorial feature).
I saw this great post : https://katstam.com/regression-feature_importance/ which could solve my problem. I added this line to my notebook (at the bottom of the article) :
onehot_columns = list(clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['one_hot'].get_feature_names(input_features=categorical_features))
and :
numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)
But I don't what "clf" refers to.
In the article, "clf" refers to this object :
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LinearRegression())])
But in my notebook I handle the preprocessing and the linear regression in two separated steps so I don't have the equivalent of this "clf" objet.
Do you have any idea or maybe another method to use ?

The author of the post you are referring to has these features at the beginning of the post: house price, location, age, interest, interest rate, year and all of a sudden when he/she is displaying the importance weights from ELI5, the features are: relationship, marital_status, education_num, etc. In other words, I don't thing that the author is displaying the weights of the dataset he/she is claiming to have used.
Having said that, what you want to do is not possible and for a reason. If you think about it, you have given your model (Linear Regression) a set of features. These features are the numerical ones but also the one-hot encoded ones. The model has learned the weights of those final features not the initial (raw) ones! So, even if you do retrieve the weights of all the features, you will have an array of float numbers (the weights) that correspond to the final feature vector you used to train your model.
Let me give you a picture:
Imagine you have gender as a categorical feature and you do One-Hot Encoding. At the end, you will have the features gender_male, gender_female, gender_non-binary, alongside location, age, interest, .... This is the feature space the model is going to be exposed to. On this feature space, the model will learn its weights. And at the end you will have a weight for each of the categories of the categorical variable gender.

Related

How can you predict a combination of categorical and continuous variables with Scikit learn?

I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)

You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).

I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.

Comparing feature importance in LightGBM + Scikit

I have a model trained using LightGBM (LGBMRegressor), in Python, with scikit-learn.
On a weekly basis the model in re-trained, and an updated set of chosen features and associated feature_importances_ are plotted. I want to compare these magnitudes along different weeks, to detect (abrupt) changes in the set of chosen variables and the importance of each of them. But I fear the raw importances are not directly comparable (I am using default split option, that gives importance to a feature based on the number of times it appears in the candidate models).
My question is: assuming these different raw importances along weeks are not directly comparable, is a normalization enough to allow the comparison? If yes, what is the best way to do such normalization? (maybe just a division by the highest importance of the week).
Thank you very much

How to handle text classification problems when multiple features are involved

I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset
Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.
In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.
1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.
2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.
3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.
All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.
Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.

Try these things:
Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.
Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.
Try combining job type and key skills and then do vectorization, see how if it works better.
Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.
Hopefully, these will increase the performance of the model.
Let me know how is it performing with these.

scikit-learn classifiers give varying results when one non-binary feature is added

I'm evaluating some machine learning models for a binary classification problem, and encountering weird results when adding one non-binary feature.
My dataset consists of tweets and some other values related to them, so the main feature vector is a sparse matrix (5000 columns) generated using scikit-learn's Tf-idf Vectoriser on the tweets and SelectKBest feature selection.
I have two other features I want to add, which are both 1-column dense matrices. I convert them to sparse and use scipy's hstack function to add them on to the main feature vector. The first of these features is binary, and when I add just that one all is good and I get accuracies of ~60%. However the second feature is integer values, and adding this causes varying results.
I am testing Logistic Regression, SVM (rbf), and Multinomial Naive Bayes. When adding the final feature the SVM accuracy increases to 80%, but for Logistic Regression it now always predicts the same class, and MNB is also very heavily skewed towards that class.
SVM confusion matrix
[[13112 3682]
[ 1958 9270]]
MNB confusion matrix
[[13403 9803]
[ 1667 3149]]
LR confusion matrix
[[15070 12952]
[ 0 0]]
Can anyone explain why this could be? I don't understand why this one extra feature could cause two of the classifiers to effectively become redundant but improve the other one so much? Thanks!

Sounds like your extra feature is non-linear. NB and LR both assume that the features are linear. SVM only assumes that the variables are linearly separable. Intuitively this means that there is a "cut-off" value for your variable that the SVM is optimizing for. If you still want to use LR or NB, you could try transforming this variable to make it linear or otherwise you could try converting it to a binary indicator variable based on this threshold, and you might improve your model's performace.
Take a look at https://stats.stackexchange.com/questions/182329/how-to-know-whether-the-data-is-linearly-separable for some further reading.

Mixed parameter types for machine learning

I wish to fit a logistic regression model with a set of parameters. The parameters that I have include three distinct types of data:
Binary data [0,1]
Categorical data which has been encoded to integers [0,1,2,3,...]
Continuous data
I have two questions regarding pre-processing the parameter data before fitting a regression model:
For the categorical data, I've seen two ways to handle this. The first method is to use a one hot encoder, thus giving a new parameter for each category. The second method, is to just encode the categories with integers within a single parameter variable [0,1,2,3,4,...]. I understand that using a one hot encoder creates more parameters and therefore increases the risk of over-fitting the model; however, other than that, are there any reasons to prefer one method over the other?
I would like to normalize the parameter data to account for the large differences between the continuous and binary data. Is it generally acceptable to normalize the binary and categorical data? Should I normalize the categorical and the continuous parameters but not the binary parameters or can I just normalize all the parameter data types.
I realize I could fit this data with a random forest model and not have to worry much about pre-processing, but I'm curious how this applies with a regression type model.
Thank you in advance for your time and consideration.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.