Building machine learning with a dataset which has only string values

Building machine learning with a dataset which has only string values - python

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

Related

How to deal with the categorical variable of more than 33 000 cities?

I work in Python. I have a problem with the categorical variable - "city".
I'm building a predictive model on a large dataset-over 1 million rows.
I have over 100 features. One of them is "city", consisting of 33 000 different cities.
I use e.g. XGBoost where I need to convert categorical variables into numeric. Dummifying causes the number of features to increase strongly. XGBoost (and my 20 gb RAM) can't handle this.
Is there any other way to deal with this variable than e.g. One Hot Encoding, dummies etc.?
(When using One Hot Encoding e.g., I have performance problems, there are too many features in my model and I'm running out of memory.)
Is there any way to deal with this?

XGBoost has also since version 1.3.0 added experimental support for categorical encoding.
Copying my answer from another question.
Nov 23, 2020
XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:
1.8.7 Categorical Data
Other than users performing encoding, XGBoost has experimental support
for categorical data using gpu_hist and gpu_predictor. No special
operation needs to be done on input test data since the information
about categories is encoded into the model during training.
https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf
In the DMatrix section the docs also say:
enable_categorical (boolean, optional) – New in version 1.3.0.
Experimental support of specializing for categorical features. Do not
set to True unless you are interested in development. Currently it’s
only available for gpu_hist tree method with 1 vs rest (one hot)
categorical split. Also, JSON serialization format, gpu_predictor and
pandas input are required.
Other models option:
If you don't need to use XGBoost, you can use a model like LightGBM or or CatBoost which support categorical encoding without one-hot-encoding out of the box.

You could use some kind of embeddings that reflect better those cities (and compress the number of total features by direct OHE), maybe using some features to describe the continet where each city belongs, then some other features to describe the country/region, etc.
Note that since you didn't provide any specific detail about this task, I've used only geographical data on my example, but you could use some other variables related to each city, like the mean temprature, the population, the area, etc, depending on the task you are trying to address here.
Another approach could be replacing the city name with its coordinates (latitude and longitude). Again, this may be helpful depending on the task for your model.
Hope this helps

Beside the models, you could also decrease the number of the features (cities) by grouping them in geographical regions. Another option is grouping them by population size.
Another option is grouping them by their frequency by using quantile bins. Target encoding might be another option for you.
Feature engineering in many cases involves a lot of manual work, unfortunately you cannot always have everything sorted out automatically.

There are already great responses here.
Other technique I would use is cluster those cities into groups using K-means clustering with some of the features specific to cities in your dataset.
By this way you could use the cluster number in place of the actual city. This could reduce the number of levels quite a bit.

Encoding Categorical Variables like "State Names"

I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type.
There are 83 unique State Names.
Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names.
Is there anything else I can try?

I would use scikit's OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or CategoricalEncoder with encoding set to 'onehot'. It automatically finds the unique values for each feature and processes it into a one hot vector. It does increase the input dimensionality for that feature, but it is necessary if you are doing any type of data science work. If you convert the feature to an ordinal integer (i.e. only one integer) as opposed to a vector of binary values, an algorithm may draw incorrect conclusions between two (possibly completely separate) categorical values that just happen to be close together in the categorical space.

There are other powerful encoding schemes beside one hot, which do not increase number of columns . You can try the following (in increasing order of complexity):
count encoding: encode each category by the number of times it occurs in data, useful in some cases. For example, if you want to encode the information that New York is a big city, the count of NY in data really contains that info, as we expect that NY will occur frequently.
target encoding: encode each category by the average value of target/outcome (if target is continuous) within that category; or by the probability of target if it is discrete. An example is when you want to encode neighborhood, which is obviously important for predicting house price; definitely, you can replace each neighborhood name by the average house price in that neighborhood. This improves prediction incredibly (as shown in my Kaggle notebook for predicting house price).
There are still other useful encoding schemes like Catboost, weight of evidence, etc. A really nice thing is all these schemes are already implemented in library categorical encoder here.

Preparing csv data to ML

I would like to implement ML model for classification problem. My csv data looks like this:
Method1; Method2; Method3; Method4; Category; Class
result1; result2; result3; result4; Sport; 12
...
...
All methods, gives a text. Sometimes it is a one word, sometimes more and sometimes the cell is empty (no answer for this method). Column "category" always has a text and column "class" is a numerical showing number of method with correct answer (i.e. number 12 means that only result from method 1 and 2 is correct). Maybe will add more column if necessary.
Now, having a new answers from all methods I would like to classify it to one of the class.
How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?

How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?
There are many different ways of doing this, but the most simple would be to just use a Bag of Words representation, which means concatenating all your Methodx columns and counting how many times does each word appears on them.
With that, you have a vector representation (each word is a column/feature, each count is a numerical value).
Now, from here there are several problems (the main one is that the number of columns/features in your dataset will be quite large), so you may have to either preprocess your data further or find a ML technique that can dealt with it for you. But, in any case, I would recommend to try to have a look at several tutorials on NLP to get a better idea of this and get a better estimation of what would be the best solution for your dataset.

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

input for scikit-learn random forest

I am trying to predict the output of tennis matches - just a fun side project. Im using a random forest regressor to do this. now, one of the features is the ranking of the player before a specific match. for many matches I dont have a ranking (I only have the first 200 ranked). The question is - is it better to put a value that is not an integer, like for example the string "NoRank", or put an integer that is beyond the range of 1-200? Considering the learning algorithm, Im inclined to put the value 201, but I would like to hear your opinions on this..
Thanks!

scikit-learn random forests do not support missing values unfortunately. If you think that unranked players are likely to behave worst that players ranked 200 on average then inputing the 201 rank makes sense.
Note: all scikit-learn models expect homogeneous numerical input features, not string labels or other python objects. If you have string labels as features you first need to find the right feature extraction strategy depending on the meaning of your string features (e.g. categorical variable identifiers or free text to be extracted as a bag of words).

I will be careful with just adding 201 (or any other value) to the nonranked ones.
RF normalize the data (Do I need to normalize (or scale) data for randomForest (R package)?), which means it can group 200 with 201 in the split, or it might not. you are basically faking data that you do not have.
I will add another column: "haverank"
and use a 0/1 for it.
0 will be for people without rank
1 for people with rank.
call it "highrank" if the name sounds better.
you can also add another column named "veryhighrank"
and give the value 1 to all players between ranks 1-50. etc...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.