I have a binary classification text data in which there are 10 text features.
I use various techniques like Bag of words, TFIDF etc. to convert them to numerical.
I use hstack() to stack all those features together again after processing them.
After converting them to numerical feature, each feature now has large number of columns hence after conversion, my dataset has around 3000 columns.
My question is when I fit this dataset into decision tree classifier (sklearn), how does the classifier recognizes the columns which belong to a particular feature?
For example first 51 column out of 3000 belong to US_states Bag of words.
Now, how will the DT recognize it?
PS: Data before processing is in pandas Dataframe.
After processing, it is a stacked numpy array being input in the classifier.
The Decision Tree won't recognize from which features the attributes are coming.
Related
from what i understand about random forest alogirthm is that the algorithm randomly samples the original dataset to build a new sampled/bootstrapped dataset. the sampled dataset then turned into decision trees.
in scikit learn, you can visualize each individual trees in random forest. but my question is, How to show the sampled/bootstrapped dataset from each of those trees?
i want to see the features and the rows of data used to build each individual trees.
I am not aware of a way to see the bootstrapped rows (samples) of your data, as this is a random process. Neither do I think it is of much importance, if the algorithm trained well.
Nevertheless, for the features: In your visualization of the tree you might already see which features were used by the splits that have been done. But you can also access them directly via the attribute feature_names_in_ (see here) for each DecisionTree in the forest, e.g.
print(rf.estimators_[0].feature_names_in_)
If your feature names in the data have not been defined as given in the documentation, a workaround is to use the feature_importances_ (see here) as a proxy instead - the features that were unused obviously have an importance of 0. You can contrast this with the number of max_features_ that you define for the training to see if you caught all.
I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.
I'm trying to build a predictive model (random forest, sgd, etc.) using scikit-learn and it seems like every model only allows you to fit text data such as
classifier.fit(X,Y)
...where Y is the target and X is a text feature vector (count_vec -> tf_idf). Is there any way to have a model which in addition to the text feature matrix also contains several categorical variables? Can I simply append them as new columns on the right side of X?
You will need to convert categorical data first - simple appending of string categories to the number values from a feature extractor like TfIdfCountVectorizer will not work. Here's a SO question and answer on converting categories into numerical feature data that you can append to the right.
I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset
Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.
In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.
1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.
2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.
3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.
All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.
Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.
Try these things:
Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.
Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.
Try combining job type and key skills and then do vectorization, see how if it works better.
Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.
Hopefully, these will increase the performance of the model.
Let me know how is it performing with these.
I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.
However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.
How can I add features to each document vector generated by the vectorizing function?
You could use the DictVectorizer for the extra categorical data and then use scipy.sparse.hstack to combine them.