How to handle text classification problems when multiple features are involved

How to handle text classification problems when multiple features are involved - python

I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset
Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.
In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.
1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.
2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.
3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.
All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.
Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.

Try these things:
Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.
Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.
Try combining job type and key skills and then do vectorization, see how if it works better.
Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.
Hopefully, these will increase the performance of the model.
Let me know how is it performing with these.

Related

Continous data prediction with all categorical features analysis

I have a case where I want to predict columns H1 and H2 which are continuous data with all categorical features in the hope of getting a combination of features that give optimal results for H1 and H2, but the distribution of the categories is uneven, there are some categories which only amount to 1,
Heres my data :
and my information of categories frequency in each column:
what I want to ask:
Does the imbalance of the features of the categories greatly affect the predictions? what is the right solution to deal with the problem?
How do you know the optimal combination? do you have to run a data test simulation predicting every combination of features with the created model?
What analytical technique is appropriate to determine the relationship between features on H1 and H2? So far I'm converting category data using one hot encoding and then calculating the correlation map
What ML model can be applied to my case? until now I have tried the RF, KNN, and SVR models but the RMSE score still high
What keywords that have similar cases and can help me to search for articles on google, this is my first time working on an ML/DS case for a paper.
thank you very much

A prediction based on a single observation won't be too reliable, of course. Binning rare categories into a sort of 'other' category is one common approach.
Feature selection is a vast topic (g: filter methods, embedded methods, wrapper methods). Personally I prefer studying mutual information and variance inflation factor first.
We cannot rely on Pearson's correlation when talking about categorical or binary features. The basic approach would be grouping your dataset by categories and comparing the target distributions for each one, running statistical tests perhaps to check whether the difference is significant. Also g: ANOVA, Kendall rank.
That said, preprocessing your data to get rid of useless or redundant features often yields much more improvement than using more complex models or hyperparameter tuning. Regardless, trying out gradient boosting models never hurts (catboost even provides a robust automatic handling of categorical features). ExtraTreesRegressor is less prone to overfitting than classic RF. Linear models should not be ignored either, especially ones like Lasso with embedded feature selection capability.

categorical features (object/float) selection for regression problem using python

I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?

I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).

How to use skmultilearn to train models on label specific data

I am using skmultilearn library to solve a multi-label machine learning problem. There are 5 labels with binary data (0 or 1). Sklearn logistic regression is being used as base classifier. But I need to set label specific features for each classifier. The label data of one classifier to be used as feature of another classifier.
I am not able to figure out on how to do that.
Any help appreciated.

One-vs-Rest is the method of solving the multi-label problem you are trying to address, it is the transformation type. You just need to generate a different training set for each simple classifier so that you have all the combinations between the original attributes and each of the labels. Pandas can be useful for the manipulation of the data and the generation of the different datasets for each simple classifier. Note that using this strategy in its original form ignores the relationships between the tags.

NLP: What is the appropriate way to use engineered features in a sklearn pipeline?

I'm conducting a text classification task for my first time (twitter sentiment analysis), but I'm unsure of how to incorporate engineered features into my sklearn pipeline.
So far I have tried transformations before outputting a classifier. For example:
model = Pipeline([('t', 'mean_vectorizer'), ('logreg', LogisticRegression())])
but all of these basic pipelines yield very low scores. So I want to start conducting grid searches and incorporating my own features.
So far, my data set (X_train) is such that the rows are tweets (single string). This is the format handled by mean_vectorizer (and tfidf_vectorizer if I use it).
Incorporating new features
Take for example 1 new feature, a boolean value for whether or not a positive word exists (just a basic example). I would create a (len(X_train), 1)-dimensional array of boolean values corresponding to each tweet.
My ideas:
After preprocessing the tweets, tokenize them and replace the words with values from a word2index dict. Pad the tweets to equal length, and then concatenate this array with my features. Then pass this into the Pipeline as normal.
Maybe there is a way in which these features can be passed individually into the Pipeline?
Maybe the transformations will have issues using an array of integers instead of strings?
Question
Could someone please advise on the best way to go forward with this using sklearn.
Assume that the data is a list of sentences (training separate from testing) and each sentence is a single string.
I think this will be a really helpful for other people starting out with NLP, so please be as general as possible.

Incorporating features into a pipeline can be done using sklearn's FeatureUnion, as suggested by Vivek Kumar with details found here on the scikit-learn website: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html.
When using the Pipeline, it should be split into a number of sections sections, with 3 important ones: the extraction of data, then the FeatureUnion, and then a classifier. Within the FeatureUnion, there will likely be multiple pipelines corresponding to different features, such as a bag of words model, Tfidf, ad-hoc features, etc.
As can be seen in detail in the link provided above, the pseudo-structure is like so:
pipeline = Pipeline([
# Get array of text
('text', TextExtractor()),
# Use FeatureUnion to combine the features used in classification
('union', FeatureUnion(
transformer_list=[
('text2', Pipeline([('f', feature1)])),
('body_bow', Pipeline([('tfidf', some_vectorizer)])),
# Pipeline for pulling ad hoc features from text
('body_stats', Pipeline([('f2', feature_dictionary)]))])),
# Use a SVC classifier on the combined features
('clf', classifier()),
])

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

According to the documentation of scikit-learn, fit() is used in order to
Learn vocabulary and idf from training set.
On the other hand, fit_transform() is used in order to
Learn vocabulary and idf, return term-document matrix.
while transform()
Transforms documents to document-term matrix.
On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).
Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.
For more details you can refer to the article fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to
function. Is it a true action or we must separate data first then
perform tfidf fit_transform on train and transform on test?
I would consider this as already leaking some information about the test set into the training set.
I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.

As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data.
If you think about doing cross validation, then you can use this logic across all the folds.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.