Preprocess large datafile with categorical and continuous features

Preprocess large datafile with categorical and continuous features - python

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Related

How should I handle NaN values in a Finance DF?

I am a beginner in Machine Learning, my point is..how should i encode the column "OECDSTInterbkRate"? I don't know how to replace the missing values and especially with what. Should I just delete them? Or replace them with the mean / median of the values?

There are many approaches to this issue.
The simplest: if you have huge amount of data - drop NaNs.
Replace the NaNs with mean/median/etc of the whole non-NaN dataset or the dataset grouped by one or several columns. E.g. for you dataset you can fill the Australia NaNs with a mean for Australian non-NaNs. And the same for other countries.
A common approach is to create another indicator column after the imputation of NaNs which keeps the indices where the missing data was replaced with a value. This column then is taken as yet another input to your ML algorithm.
Take a look at the docs (assuming you work with Pandas) - the developers of the library have already created some tools for the missing data: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

There's no specific answer to your question, it's a general problem in statistics which is called "imputation". Depending on the application the answer could be many things.
There are few alternatives that comes to mind first to solve your problem, but don't forget that "no data" is almost always better than "bad/wrong data". If you have more than enough rows without the rows with NaNs, you may simply drop them. Otherwise you can consider the following:
Can you mathematically calculate the column that you need by the other columns that you already have in your dataset? If so, you have your answer.
Check the correlation of the particular column by using it's non-missing valued rows with the other columns and see if they are highly correlated. If so, you might just as well try dropping the whole column(might not be always a good idea but it's generally a good idea).
Can you create an estimator(such as a regression model) to predict the missing values by learning the pattern using the values that you already have and by using the other columns with a really good accuracy? Well you might have an answer (need benchmarking with the following). Please keep in mind that this is a very risky operation that could give bad estimations and decrease the performance of your overall model. Try this only if your estimations are really good!
Is it a regression problem? Using the statistical mean could be a good idea.
Is it a classification problem? Using median could be a good idea.
In some cases using mode might also be a good idea depending on the distribution.
I suggest that you try all the things out and see which one works better because there's really not a concrete answer to your problem. You can create a machine learning model without using the column and use it's performance as a baseline, and carry out a performance(accuracy) benchmarking for all the steps compared to the baseline.
Note: I am just a graduate student with some insights, please comment out if anything I said is not correct!

Null values in datset

I'm using a dataset to predict the effects on the economy because of covid-19. The dataset contains 9k rows and around 1k rows in each column is empty. Do I need to fill them manually by looking at other datasets online or can I fill the average or should I leave the dataset as it is?

Generally, I'd say that combining datasets from multiple sources without being really clear about your rational can raise pretty big questions about the reliability of your data.
Otherwise, either assuming averages or leaving null are both valid options depending on what you're trying to do. If you're using scikit learn (eg) you'll probably find that nulls throw up errors, so filling with assumed averages is a relatively common thing to do. Although you might want to watch out given you've got more that 10% nulls!
From experience, I'd say think about what you're trying to do, and what will help you get there best. Then be really clear about presenting your methodology with your findings.

How to deal with the categorical variable of more than 33 000 cities?

I work in Python. I have a problem with the categorical variable - "city".
I'm building a predictive model on a large dataset-over 1 million rows.
I have over 100 features. One of them is "city", consisting of 33 000 different cities.
I use e.g. XGBoost where I need to convert categorical variables into numeric. Dummifying causes the number of features to increase strongly. XGBoost (and my 20 gb RAM) can't handle this.
Is there any other way to deal with this variable than e.g. One Hot Encoding, dummies etc.?
(When using One Hot Encoding e.g., I have performance problems, there are too many features in my model and I'm running out of memory.)
Is there any way to deal with this?

XGBoost has also since version 1.3.0 added experimental support for categorical encoding.
Copying my answer from another question.
Nov 23, 2020
XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:
1.8.7 Categorical Data
Other than users performing encoding, XGBoost has experimental support
for categorical data using gpu_hist and gpu_predictor. No special
operation needs to be done on input test data since the information
about categories is encoded into the model during training.
https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf
In the DMatrix section the docs also say:
enable_categorical (boolean, optional) – New in version 1.3.0.
Experimental support of specializing for categorical features. Do not
set to True unless you are interested in development. Currently it’s
only available for gpu_hist tree method with 1 vs rest (one hot)
categorical split. Also, JSON serialization format, gpu_predictor and
pandas input are required.
Other models option:
If you don't need to use XGBoost, you can use a model like LightGBM or or CatBoost which support categorical encoding without one-hot-encoding out of the box.

You could use some kind of embeddings that reflect better those cities (and compress the number of total features by direct OHE), maybe using some features to describe the continet where each city belongs, then some other features to describe the country/region, etc.
Note that since you didn't provide any specific detail about this task, I've used only geographical data on my example, but you could use some other variables related to each city, like the mean temprature, the population, the area, etc, depending on the task you are trying to address here.
Another approach could be replacing the city name with its coordinates (latitude and longitude). Again, this may be helpful depending on the task for your model.
Hope this helps

Beside the models, you could also decrease the number of the features (cities) by grouping them in geographical regions. Another option is grouping them by population size.
Another option is grouping them by their frequency by using quantile bins. Target encoding might be another option for you.
Feature engineering in many cases involves a lot of manual work, unfortunately you cannot always have everything sorted out automatically.

There are already great responses here.
Other technique I would use is cluster those cities into groups using K-means clustering with some of the features specific to cities in your dataset.
By this way you could use the cluster number in place of the actual city. This could reduce the number of levels quite a bit.

Handle missing values : When 99% of the data is missing from most columns (important ones)

I am facing a dilemma with a project of mine. Few of the variables don't have enough data that means almost 99% data observations are missing.
I am thinking of couple of options -
Impute missing value with mean/knn imputation
Impute missing value with 0.
I couldn't think of anything in this direction. If someone can help that would be great.
P.S. I am not feeling comfortable using mean imputation when 99% of the data is missing. Does someone have a reasoning for that? kindly let me know.
Data has 397576 Observations out of which below are the missing values
enter image description here

99% of the data is missing!!!???
Well, if your dataset has less than 100,000 examples, then you may want to remove those columns instead of imputing through any methods.
If you have a larger dataset then using mean imputing or knn imputing would be ...OK. These methods don't catch the statistics of your data and can eat up memory. Instead use Bayesian methods of Machine Learning like fitting a Gaussian Process through your data or a Variational Auto-Encoder to those sparse columns.
1.) Here are a few links to learn and use gaussian processes to samples missing values from the dataset:
What is a Random Process?
How to handle missing values with GP?
2.) You can also use a VAE to impute the missing values!!!
Try reading this paper
I hope this helps!

My first question to give a good answer would be:
What you are actually trying to archive with the completed data?
.
People impute data for different reasons and the use case makes a big difference for example you could use imputation as:
Preprocessing step for training a machine learning model
Solution to have a nice Graphic/Plot that does not have gaps
Statistical inference tool to evaluate scientific or medical studies
99% of missing data is a lot - in most cases you can expect, that nothing meaningful will come out of this.
For some variables it still might make sense and produce at least something meaningful - but you have to handle this with care and think a lot about your solution.
In general you can say, imputation does not create entries out of thin air. A pattern must be present in the existing data - which then is applied to the missing data.
You probably will have to decide on a variable basis what makes sense.
Take your variable email as an example:
Depending how your data - it might be that each row represents a different customer that has a specific email address. So that every row is supposed to be a unique mail address. In this case imputation won't have any benefits - how should the algorithm guess the email. But if the data is structured differently and customers appear in multiple rows - then an algorithm can still fill in some meaningful data. Seeing that Customer number 4 always has the same mail address and filling it for rows where only customer number 4 is given and the mail is missing.

Outlier detection in non-normally distributed data

I have a big dataset, which contains yearly rapports of companies.
In this dataset I want to detect errors/outliers. These outliers are mainly human input errors. I have trouble deciding which is the best strategy to use for this problem, since my data is not normal distributed.
My dataset contains about 100 columns.
Does anyone has some input on techniques, for detecting human errors?
Think of comma error, to many zeros, ect
Thank you in advance

Well looks it is a complicated problem.
Looks you data has following features.
1. NLP knowledge: company rapports piece of articles. To analysis it, NLP has to be adapted.
2. high dimention: currently you has about 100 columns, considering the NLP decomposed result, you might has thousands of columns in certain cases.
3. non normal distributed.
To solve it, you may try to :
1. Use NLP way to transformat article to numeric information
2. Use typical novel or outlier tools to find it. you can try SKlearn model.
https://scikit-learn.org/stable/modules/outlier_detection.html
Hope it can help you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.