Scikit Learn Categorical data with random forests - python

I am trying to work with the titanic survival challenge in kaggle https://www.kaggle.com/c/titanic.
I am not experienced in R so i am using Python and Scikit Learn for the Random Forest Classifier
I am seeing many people using scikit learn converting their categorical of many levels into dummy variables.
I don't understand the point of doing this, why can't we just map the levels into a numeric value and be done with it.
And also i saw someone do the following:
There was a categorical feature Pclass with three levels, he created 3 dummy variables for this and dropped the variable which had the least survival rate. I couldn't understand this either, i though decision trees didn't care about correlated features.

If you just map levels to numeric values, python will treat your values as numeric. That is, numerically 1<2 and so on even if your levels were initially unordered. Think about the "distance" problem. This distance between 1 and 2 is 1, between 1 and 3 is 2. But what were the original distances between your categorical variables? For example, what are the distances between "banana" "peach" and "apple"? Do you suppose that they are all equal?
About dummy variable: if you have 3 classes and create 3 dummy variables, they not just correlated, they are linearly dependent. This is never good.

Related

Sklearn regression with label encoding

I'm attempting to use sklearn's linear regression model to predict fantasy players points. I have numeric stats for each player and obviously their name which I have encoded with the Label encoder function. My question is when performing the linear regression the encoded values included in the training it doesn't seem to recognize it as an ID but instead treats it as a numeric value.
So is there a better way to encode player names so they are treated as an ID so that it recognizes player 1 averages 25 points compared to player 2's 20? Or is this type of encoding even possible with linear regression? Thanks in advance
Apart from one hot encoding (which might create way too many columns in this case), mean target encoding does exactly what you need (encodes the category with its mean target value). You should be vary about the target leakage in case of rare categories though. sklearn-compatible category_encoders library provides several robust implementations, such as LeaveOneOutEncoder()

Random Forest or other machine learning techniques [need advice]

I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.
"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_ or better the SHAP-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.
My idea is as follows. Your outcome variable 'Y' has only a few possible values. You can build a classifier (Random Forest is one of many existing classifiers), to predict say 'Y in [25-94,95-105,106-150]'. You will here have three different outcomes that rule out each other. (Other interval limits than 95 and 105 are possible, if that better suits your application).
Some of your predictive variables are time series whereas others are constant, as you explain. You should use a sliding window technique where your classifier predicts 'Y' based on the time-related variables in say the month January. It doesn't matter that some variables are constant, as the actual variable 'City' has the four outcomes: '[City_1,City_2,City_3,City_4]'. Similarly, use 'Population' and 'Age_mean' as the actual variables.
Once you use classifiers, many approaches to feature ranking and feature selection have been developed. You can use a web service like insight classifiers to do it for you, or download a package like Weka for that.
Key point is that you organize your model and its predictive variables such that a classifier can learn correctly.
If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.

Can the input for "Lasso" in python contain categorical variables?

I want to perform multiple linear regression in python with lasso. I am not sure whether the input observation matrix X can contain categorical variables. I read the instructions from here: lasso in python
But it is simple and not indicate the types allowed for. For example, my code includes:
model = Lasso(fit_intercept=False, alpha=0.01)
model.fit(X, y)
In the code above, X is an observation matrix with size of n-by-p, can one of the p variables be categorical type?
You need to represent the categorical variables using 1s and 0s. If your categorical variables are binary, meaning each belongs to one of two categories, then you replace all category A and B variables into 0s and 1s, respectively. If some have more than two categories, you will need to use dummy variables.
I usually have my data in a Pandas dataframe, in which case I use houses = pd.get_dummies(houses), which creates the dummy variables.
A previous poster has a good answer for this, you need to encode your categorical variables. The standard way is one hot encoding (or dummy encoding), but there are a many methods for doing this.
Here is a good library that has many different ways you can encode your categorical variables. These are also implemented to work with Sci-kit learn.
https://contrib.scikit-learn.org/categorical-encoding/

Scikit-Learn Random Forest regression: mix two sets of true values (y)

I am training Random Forests with two sets of "true" y values (empirical). I can easy tell which one is better.
However, I was wondering if there is a simple method, other than brute force, to pick up the values from each set that would produce the best model. In other words, I would like to automatically mix both y sets to produce a new ideal one.
Say, for instance, biological activity. Different experiments and different databases provide different values. This is a simple example showing two different sets of y values on columns 3 and 4.
4a50,DQ7,47.6,45.4
3atu,ADP,47.7,30.7
5i9i,5HV,47.7,41.9
5jzn,GUI,47.7,34.2
4bjx,73B,48.0,44.0
4a6c,QG9,48.1,45.5
I know that column 3 is better because I have already trained different models against each of them and also because I checked a few articles to verify which value is correct and 3 is right more often than 4. However, I have thousands of rows and cannot read thousands of papers.
So I would like to know if there is an algorithm that, for instance, would use 3 as a base for the true y values but would pick values from 4 when the model improves by so doing.
It would be useful it it would report the final y column and be able to use more than 2, but I think I can figure out that.
The idea now is to find out if there is already a solution out there so that I don't need to reinvent the wheel.
Best,
Miro
NOTE: The features (x) are in a different file.
The problem is that an algorithm alone doesn't know which label is better.
What you could do: Train a classifier on data which you know is correct. Use the clasifier to predcit a value for each datapoint. Compare this value to the two list of labels which you already have and choose the label which is closer.
This solution obviously isn't perfect since the results depends on quality of the classfier which predicts the value and you still need enough labeled data to train the classifier. Additionaly there is also a chance that the classifier itself predicts a better value compared to your two lists of labels.
Choose column 3 and column 4 both together as target/predicted/y values in Random Forest classifier model fitting - and predict it with your result. Thus, your algorithm can keep track of both Y values and their correlation to predicted values. Your problem seems to be Multi-output classification problem, where there are multiple target/predicted variables (multiple y - values ) as you suggest.
Random forest supports this multi-output classification using random forest. Random Forest fit(X,y) method supports y to be array-like y : array-like, shape = [n_samples, n_outputs]
multioutput-classification
sklearn.ensemble.RandomForestClassifier.fit
Check multi-class and multi-output classification

How to normalise dataset for linear/multi regression in python

I am using a data-set to make some predictions using the multi-variable regression techniques. I have to predict the salary of the employees based on some independent variables like gender, percentage, date of birth, marks in different subjects, degree, specialization etc.
Numeric parameters(eg- marks and percentage in different subjects) are fine to be used with the regression model. But how do we normalize the non-numeric parameters (gender, date of birth, degree, specialization) here ?
P.S. : I am using the scikit-learn : machine learning in python package.
You want to encode your categorical parameters.
For binary categorical parameters such as gender, this is relatively easy: introduce a single binary parameter: 1=female, 0=male.
If there are more than two categories, you could try the one-hot-encoding.
Read more on the sci-kit learn documenten:
http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Note that date is not a categorical parameter! Convert it into a unix timestamp (seconds since epoch) and you have a nice parameter on which you can regress.
"Normaliz[ing] non-numeric parameters" is actually a huge area of regression. The most common treatment is to turn each categorical into a set of binary variables called dummy variables.
Each categorical with n values should be converted into n-1 dummy variables. So for example, for gender, you might have one variable, "female", that would be either 0 or 1 at each observation. Why n-1 and not n? Because you want to avoid the dummy variable trap, where basically the intercept column of all 1's can be reconstructed from a linear combination of your dummy columns. In relatively non-technical terms, that's bad because it messes up the linear algebra needed to do the regression.
I am not so familiar with the scikit-learn library but I urge you to make sure that whatever methods you do use, you ensure that each categorical becomes n-1 new columns, and not n.
I hope this can help you. The whole description of how to use that function is available on this link.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

Categories

Resources