How to normalise dataset for linear/multi regression in python

How to normalise dataset for linear/multi regression in python - python

I am using a data-set to make some predictions using the multi-variable regression techniques. I have to predict the salary of the employees based on some independent variables like gender, percentage, date of birth, marks in different subjects, degree, specialization etc.
Numeric parameters(eg- marks and percentage in different subjects) are fine to be used with the regression model. But how do we normalize the non-numeric parameters (gender, date of birth, degree, specialization) here ?
P.S. : I am using the scikit-learn : machine learning in python package.

You want to encode your categorical parameters.
For binary categorical parameters such as gender, this is relatively easy: introduce a single binary parameter: 1=female, 0=male.
If there are more than two categories, you could try the one-hot-encoding.
Read more on the sci-kit learn documenten:
http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Note that date is not a categorical parameter! Convert it into a unix timestamp (seconds since epoch) and you have a nice parameter on which you can regress.

"Normaliz[ing] non-numeric parameters" is actually a huge area of regression. The most common treatment is to turn each categorical into a set of binary variables called dummy variables.
Each categorical with n values should be converted into n-1 dummy variables. So for example, for gender, you might have one variable, "female", that would be either 0 or 1 at each observation. Why n-1 and not n? Because you want to avoid the dummy variable trap, where basically the intercept column of all 1's can be reconstructed from a linear combination of your dummy columns. In relatively non-technical terms, that's bad because it messes up the linear algebra needed to do the regression.
I am not so familiar with the scikit-learn library but I urge you to make sure that whatever methods you do use, you ensure that each categorical becomes n-1 new columns, and not n.

I hope this can help you. The whole description of how to use that function is available on this link.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

Related

Sklearn regression with label encoding

I'm attempting to use sklearn's linear regression model to predict fantasy players points. I have numeric stats for each player and obviously their name which I have encoded with the Label encoder function. My question is when performing the linear regression the encoded values included in the training it doesn't seem to recognize it as an ID but instead treats it as a numeric value.
So is there a better way to encode player names so they are treated as an ID so that it recognizes player 1 averages 25 points compared to player 2's 20? Or is this type of encoding even possible with linear regression? Thanks in advance

Apart from one hot encoding (which might create way too many columns in this case), mean target encoding does exactly what you need (encodes the category with its mean target value). You should be vary about the target leakage in case of rare categories though. sklearn-compatible category_encoders library provides several robust implementations, such as LeaveOneOutEncoder()

Random Forest or other machine learning techniques [need advice]

I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.

"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_ or better the SHAP-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.

My idea is as follows. Your outcome variable 'Y' has only a few possible values. You can build a classifier (Random Forest is one of many existing classifiers), to predict say 'Y in [25-94,95-105,106-150]'. You will here have three different outcomes that rule out each other. (Other interval limits than 95 and 105 are possible, if that better suits your application).
Some of your predictive variables are time series whereas others are constant, as you explain. You should use a sliding window technique where your classifier predicts 'Y' based on the time-related variables in say the month January. It doesn't matter that some variables are constant, as the actual variable 'City' has the four outcomes: '[City_1,City_2,City_3,City_4]'. Similarly, use 'Population' and 'Age_mean' as the actual variables.
Once you use classifiers, many approaches to feature ranking and feature selection have been developed. You can use a web service like insight classifiers to do it for you, or download a package like Weka for that.
Key point is that you organize your model and its predictive variables such that a classifier can learn correctly.

If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.

Can the input for "Lasso" in python contain categorical variables?

I want to perform multiple linear regression in python with lasso. I am not sure whether the input observation matrix X can contain categorical variables. I read the instructions from here: lasso in python
But it is simple and not indicate the types allowed for. For example, my code includes:
model = Lasso(fit_intercept=False, alpha=0.01)
model.fit(X, y)
In the code above, X is an observation matrix with size of n-by-p, can one of the p variables be categorical type?

You need to represent the categorical variables using 1s and 0s. If your categorical variables are binary, meaning each belongs to one of two categories, then you replace all category A and B variables into 0s and 1s, respectively. If some have more than two categories, you will need to use dummy variables.
I usually have my data in a Pandas dataframe, in which case I use houses = pd.get_dummies(houses), which creates the dummy variables.

A previous poster has a good answer for this, you need to encode your categorical variables. The standard way is one hot encoding (or dummy encoding), but there are a many methods for doing this.
Here is a good library that has many different ways you can encode your categorical variables. These are also implemented to work with Sci-kit learn.
https://contrib.scikit-learn.org/categorical-encoding/

Scikit Learn Categorical data with random forests

I am trying to work with the titanic survival challenge in kaggle https://www.kaggle.com/c/titanic.
I am not experienced in R so i am using Python and Scikit Learn for the Random Forest Classifier
I am seeing many people using scikit learn converting their categorical of many levels into dummy variables.
I don't understand the point of doing this, why can't we just map the levels into a numeric value and be done with it.
And also i saw someone do the following:
There was a categorical feature Pclass with three levels, he created 3 dummy variables for this and dropped the variable which had the least survival rate. I couldn't understand this either, i though decision trees didn't care about correlated features.

If you just map levels to numeric values, python will treat your values as numeric. That is, numerically 1<2 and so on even if your levels were initially unordered. Think about the "distance" problem. This distance between 1 and 2 is 1, between 1 and 3 is 2. But what were the original distances between your categorical variables? For example, what are the distances between "banana" "peach" and "apple"? Do you suppose that they are all equal?
About dummy variable: if you have 3 classes and create 3 dummy variables, they not just correlated, they are linearly dependent. This is never good.

Pandas + Patsy + Statsmodels Linear Reg issue passing in categorical variable (duplicate rows)

[Preface: I now realize I should've used a classification model (maybe decision tree) instead but I ended up using a linear regression model.]
I had a pandas dataframe as such:
And I want to predict audience score using genre, year, tomato-meter score. But as is constructed, the genres for each movie came in a list, so I felt the need to isolate each genre to pass each genre into my model as separate variables.
After doing such, my modified dataframe looks like this, with duplicate rows for each movie, but each genre element of that movie isolated (just one movie pulled from the dataframe to show):
Now, my question is, can I pass in this second dataframe as is to Patsy and statsmodel linear regression, or will the row duplication introduce bias into my model?
y1, X1 = dmatrices('Q("Audience Score") ~ Year + Q("Tomato-meter") + Genre',
data=DF2, return_type='dataframe')
In summary, looking for a way for patsy and my model to recognize treat each genre as separate variables.. but want to make sure I'm not fudging the numbers/model by passing in a dataframe in this format as the data (as not every movie as the same # of genres).

I see two problems with the approach:
parameter estimates:
If there are different number of repeated observations, the weight for observations with multiple categories will be larger than observations with only a single category. This could be corrected by using weights in the linear model. Use WLS with weights equal to the inverse of the number of repetitions (or the square root of it ?). Weights are not available for other models like Poisson or Logit or GLM-Binomial. This will not make a larger difference for the parameter estimates, if the "pattern", i.e. the underlying parameters are not systematically different across movies with different number of categories.
Inference, standard error of parameter estimates:
All basic models like OLS, Poisson and so on assume that each row is an independent observation. The total number of rows will be larger than the number of actual observations and the estimated standard errors of the parameters will be underestimated. (We could use cluster robust standard errors, but I never checked how well they work with duplicate observations, i.e. response is identical across several observations.)
Alternative
As an alternative to repeating observations, I would encode the categories into non-exclusive dummy variables. For example, if we have three levels of the categorical variable, movie categories in this case, then we add a 1 in each corresponding column if the observation is "in" that category.
Patsy doesn't have premade support for this, so the design matrix for the movie category would need to be build by hand or as the sum of the individual dummy matrices (without dropping a reference category).
alternative model
This is not directly related to the issue of multiple categories in the explanatory variables.
The response variable movie ratings is bound to be between 0 and 100. A linear model will work well as a local approximation, but will not take into account that observed ratings are in a limited range and will not enforce it for prediction.
Poisson regression could be used to take the non-negativity into account, but wouldn't use the upper bound. Two alternatives that will be more appropriate are GLM with Binomial family and a total count for each observation set to 100 (maximum possible rating), or use a binary model, e.g. Logit or Probit, after rescaling the ratings to be between 0 and 1.
The latter corresponds to estimating a model for proportions which can be estimated with the statsmodels binary response models. To have inference that is correct even if the data is not binary, we can use robust standard errors. For example
result = sm.Logity(y_proportion, x).fit(cov_type='HC0')

Patsy doesn't have any built-in way to separate out a "multi-category" like your Genre variable, and as far as I know there's no direct way to represent it in Pandas either.
I'd break the Genre into a bunch of booleans columns, one per category: Mystery = True/False, Comedy = True/False, etc. That fits better with both pandas and patsy's way of representing things.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.