How to deal with "None" when I using sklearn-decisiontreeclassifier? - python

When I use sklearn to built a decisiontree,examples:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,Y)
result = clf.predict(testdata)
X is the training input samples,if there is "None" in X,How to do with it?

Decision Trees and Ensemble methods like Random Forests (based on such trees) only accept numerical data since it performs splits on each node of the tree in order to minimize a given impurity function (entropy, Gini index ...)
If you have some categorical features or some Nan in your data, the learning step will throw an error.
To circumvent this :
Transform categorical data into numerical data : to do this use for example a One Hot Encoder. Here is a link to sklearn's documentation.
Warning : If you have a feature with a lot of categories (e.g. an ID feature) OneHotEncoding may lead to memory issues. Try to avoid encoding such features.
Impute some values to the missing ones. Many strategies exist (mean, median, most frequent ...). Here is a link to sklearn's documentation.
Once you've done this preprocessing, you can fit your Decision Tree to your data.

Related

categorical features (object/float) selection for regression problem using python

I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).

How to reverse the OneHotEncoding to calculate the coef_ with sklearn?

After computing a linear regression and calculate my R2 score, I would like to calculate the coefficients (with the sklearn coef_ attribute ) of my features.
The point is that in my features I have some numerical and some categorical data. To process the LinearRegression() I have OneHotEncoded my categorical values. So in this case it is not possible to directly calculate the coef_ of each feature.
How would you recover the process to have one single column for each categorial feature (because OneHotEncoding means as many columns as possible values ​​for each categorial feature).
I saw this great post : https://katstam.com/regression-feature_importance/ which could solve my problem. I added this line to my notebook (at the bottom of the article) :
onehot_columns = list(clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['one_hot'].get_feature_names(input_features=categorical_features))
and :
numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)
But I don't what "clf" refers to.
In the article, "clf" refers to this object :
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LinearRegression())])
But in my notebook I handle the preprocessing and the linear regression in two separated steps so I don't have the equivalent of this "clf" objet.
Do you have any idea or maybe another method to use ?
The author of the post you are referring to has these features at the beginning of the post: house price, location, age, interest, interest rate, year and all of a sudden when he/she is displaying the importance weights from ELI5, the features are: relationship, marital_status, education_num, etc. In other words, I don't thing that the author is displaying the weights of the dataset he/she is claiming to have used.
Having said that, what you want to do is not possible and for a reason. If you think about it, you have given your model (Linear Regression) a set of features. These features are the numerical ones but also the one-hot encoded ones. The model has learned the weights of those final features not the initial (raw) ones! So, even if you do retrieve the weights of all the features, you will have an array of float numbers (the weights) that correspond to the final feature vector you used to train your model.
Let me give you a picture:
Imagine you have gender as a categorical feature and you do One-Hot Encoding. At the end, you will have the features gender_male, gender_female, gender_non-binary, alongside location, age, interest, .... This is the feature space the model is going to be exposed to. On this feature space, the model will learn its weights. And at the end you will have a weight for each of the categories of the categorical variable gender.

Finding label-specific top features for non-linear classifier

Is there any function that gives the top features of each label in a Random Forest/ XG Boost classifier? The classifier.feature_importances_ only gives top features for the classifier as a whole.
Looking for something similar to the classifier.coef_ that gives label-specific top features for SVM and Naive Bayes classifiers in sklearn.
import pandas as pd
feature_importances = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance',ascending=False)
Try with this!
Or 1 vs Rest is also an good option but take lot of time.
Firstly, Random Forest / Xgboost or even a simple DecisionTree/ any Tree ensemble is a inherent multi-class classification model. Hence it will predict the multi-class output without using any wrapper ( 1 vs 1 / 1 vs Rest) on top of binary classifier (which is what the logistic regression/SVM/SGDClassifier would do).
Hence, you can get the feature importance for the overall multi-class classification alone and not for individual labels.
If you really want to know the feature importance for individual labels, then use onevsRest wrapper with decisionTree/ RandomForest/ Xgboost as the estimator.
This is not the recommended approach because the results could be suboptimal when compared with single decision Tree.
Some examples here.

How to handle text classification problems when multiple features are involved

I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset
Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.
In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.
1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.
2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.
3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.
All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.
Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.
Try these things:
Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.
Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.
Try combining job type and key skills and then do vectorization, see how if it works better.
Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.
Hopefully, these will increase the performance of the model.
Let me know how is it performing with these.

Getting the list of features used during training of Random Forest Regressor

I used one set of data to learn a Random Forest Regressor and right now I have another dataset with smaller number of features (the subset of the previous set).
Is there a function which allows to get the list of names of columns used during the training of the Random Forest Regressor model?
If not, then is there a function which for the missing columns would assign Nulls?
Is there a function which allows to get the list of names of columns
used during the training of the Random Forest Regressor model?
RF uses all features from your dataset. Each tree may contain sqrt(num_of_features) or log2(num_of_features) or whatever but these columns are selected at random. So usually RF covers all columns from your dataset.
There may be edge case when you use a small number of estimators in RF and some features may not be considered. I suppose, RandomForestRegressor.feature_importances_ (zero or nan value may be indicators here) or dive into each tree in RandomForestRegressor.estimators_ may help.
If not, then is there a function which for the missing columns would
assign Nulls?
RF does not accept missing values. Either you need to code missing value as the separate class (and use it for learning too) or XGBoost (for example) is your choice.

Categories

Resources