Building a predictive model with text data and other predictors - python

I'm trying to build a predictive model (random forest, sgd, etc.) using scikit-learn and it seems like every model only allows you to fit text data such as
classifier.fit(X,Y)
...where Y is the target and X is a text feature vector (count_vec -> tf_idf). Is there any way to have a model which in addition to the text feature matrix also contains several categorical variables? Can I simply append them as new columns on the right side of X?

You will need to convert categorical data first - simple appending of string categories to the number values from a feature extractor like TfIdfCountVectorizer will not work. Here's a SO question and answer on converting categories into numerical feature data that you can append to the right.

Related

Encoding large number of categorical features [duplicate]

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

Prediction using a model in machine learning python

Say I have created a randomforest regression model using test/train data available with me.
This contains feature scaling and categorical data encoding.
Now if I get a new dataset on a new day and I need to use this model to predict the outcome of this new dataset and compare it with the new dataset outcome that I have, do I need to apply feature scaling and categorical data encoding on this dataset as well?
For example .. day 1 I have 10K rows with 6 features and 1 label -- a regression problem
I built a model using this.
On day 2, I get 2K rows with same features and label but of course the data within it would be different.
Now I want to firstly predict using this model and day 2 data, what should be the label as per my model.
Secondly, using this result I want to compare the outcome of the model against the day 2 original label that I have.
So in order to do this, when I pass the day 2 features as the test set to the model, do I need to first do feature scaling and categorical data encoding on them?
This is somewhat to do with making predictions and validating with the received data in order to assess the data quality of the received data.
You always need to pass the data to the model in the format it is expecting them. If the model has been trained on scaled, encoded, ... data. You need to do perform all these transformations every time you are pushing new data into the trained model (for whatever reason).
The easiest solution is to use sklearn's Pipeline to create a pipeline with all those transformations included and then use it, instead of the model itself to make predictions for new entries so that all those transformations are automatically applied.
example - automatically applying StandardScaler's scaling feature before passing data into the model:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
// then
pipe.fit...
pipe.score...
pipe.predict...
The same holds for dependent variable. If you scaled it before you trained your model, you will need to scale the new ones as well, or you will need to apply inverse operation on the output of the model before you compare it with the original dependent variable values.

Structure of data for multilabel Classification problem

I am working on a prediction problem where a user is having access to multiple Target and each access is having separate row. Below is the data
df=pd.DataFrame({"ID":[12567,12567,12567,12568,12568],"UnCode":[LLLLLLL,LLLLLLL,LLLLLLL,KKKKKK,KKKKKK],
"CoCode":[1000,1000,1000,1111,1111],"CatCode":[1,1,1,2,2],"RoCode":["KK","KK","KK","MM","MM"],"Target":[12,4,6,1,6]
})
**Here ID is unique but can be repeated if user has accessed multiple targets and target can be repeated as well if accessed by different ID's**
I have converted this data to OHE and used for prediction using binary relevance, where my X is constant and target is varying.
Problem I am facing with this approach is the data becomes sparse and number of features in my original data are around 1300.
Can someone suggest me whether this approach is correct or not and what other methods/approach I can use in this type of problem. Also is this problem can be treated as multilabel classification?
Below is the input data for model

How does decision tree recognize the features from a given text dataset?

I have a binary classification text data in which there are 10 text features.
I use various techniques like Bag of words, TFIDF etc. to convert them to numerical.
I use hstack() to stack all those features together again after processing them.
After converting them to numerical feature, each feature now has large number of columns hence after conversion, my dataset has around 3000 columns.
My question is when I fit this dataset into decision tree classifier (sklearn), how does the classifier recognizes the columns which belong to a particular feature?
For example first 51 column out of 3000 belong to US_states Bag of words.
Now, how will the DT recognize it?
PS: Data before processing is in pandas Dataframe.
After processing, it is a stacked numpy array being input in the classifier.
The Decision Tree won't recognize from which features the attributes are coming.

Mixed parameter types for machine learning

I wish to fit a logistic regression model with a set of parameters. The parameters that I have include three distinct types of data:
Binary data [0,1]
Categorical data which has been encoded to integers [0,1,2,3,...]
Continuous data
I have two questions regarding pre-processing the parameter data before fitting a regression model:
For the categorical data, I've seen two ways to handle this. The first method is to use a one hot encoder, thus giving a new parameter for each category. The second method, is to just encode the categories with integers within a single parameter variable [0,1,2,3,4,...]. I understand that using a one hot encoder creates more parameters and therefore increases the risk of over-fitting the model; however, other than that, are there any reasons to prefer one method over the other?
I would like to normalize the parameter data to account for the large differences between the continuous and binary data. Is it generally acceptable to normalize the binary and categorical data? Should I normalize the categorical and the continuous parameters but not the binary parameters or can I just normalize all the parameter data types.
I realize I could fit this data with a random forest model and not have to worry much about pre-processing, but I'm curious how this applies with a regression type model.
Thank you in advance for your time and consideration.

Categories

Resources