Variable selection involving mixture of numerical, high cardinal,low cardinal features

Variable selection involving mixture of numerical, high cardinal,low cardinal features - python

Consider a dummy dataframe:
A B C D …. Z
1 2 as we 2
2 4 qq rr 5
4 5 tz rc 9
This dataframe has 25 independent variables and one target variable ,the independent variables are mixture of high cardinal features, numerical features and low cardinal features, and the target variable is numerical. Now I first want to select or filter variables which are helpful in predicting the target variable. Any suggestions or tips towards achieving this goal is appreciated. Hope my question is clear, if the form of question is unclear I welcome the suggestions to make correction.
What I tried so far?
I applied target mean encoding (smooth mean) on the categorical features w.r.t target variable. Then I applied random forest to know variable importance. And the weird thing is that the random forest is selecting only one feature all the time, I expected at least 3-4 meaningful variables. I tried neural networks but the result is no different , what would be reason for this? What does that mean if the algorithms only using one variable? And the test predictions are not very accurate. The RMSE is about 2.4 where the target feature usually range from 20-40 in value. Thank you for your patience on reading this.
P.S: I am using SKlearn and in python.

Related

Removing Outliers in a Multi-feature Regression Problem

I have a regression problem having 1 target and 10 features. When I look at the outliers for each feature by box-plot, they have different number of outliers. While removing outliers, do I need to also remove the associated target values with those outliers?
I mean, let's say: for #1 feature I have 12 outliers and I removed them with 12 target values. Then, for #2 feature I have 23 outliers and I removed them with 23 target values, as well, and so on. The procedure would be like this, or how should I proceed?

I imagine each row of your data contains an ID, the value of the target and 10 feature values, one of each feature. To answer our question: if you want to remove the outliers, you have to remove the whole observation/row - the value that you classify as an outlier, the corresponding target value, as well as all other 9 corresponding feature values. So you would have to filter each row for the entry of feature_i being smaller than the threshold_i that you defined as an outlier.
The reason is that a multilinear regression calculates the influence of an incremental change in one feature on the target, assuming all other 9 features being constant. Removing a single feature value without removing the target and the other features of this observation simply does not work in such a model (assuming you are using an OLS).
However, I would be cautious with removing outliers. I don't know about your sample size and what you consider an outlier and it would help to know more about your research question, data and methodology.

How to interpret results of Linear Regression after log-transforming the target variable?

I built Liear Regression model in Python and I had target variable for example Sales: 10, 9, 8.
I decided to log my target variable: df["Sales"] = np.log(df["Sales"])so I have after that values np 3, 2, 1.
My question is how can I interpretate results of this model being aware that my target was log ? Because currently I have interpretation for example: If there is night sales decrease average by 1.333 nevertheless it is probably bad interpretation because without log of target I will have result in definitely higher quantification like If there is night sales decrease average by for example 2 589.
So how can I interpretate results of Linear Regression after log of target ? Because after log target has really low values ?

Your transformation is called a "log-level" regression. That is, your target variable was log-transformed and your independent variables are left in their normal scales.
The model should be interpreted as follows:
On average, a marginal change in X_i will cause a change of 100 * B_i percent.
Do note that if you transformed any of your independent variables, the interpretation will change too. For example, if you changed X_i to np.log(df['X_i]), then you would interpret B_i` as a log-log transformation.
You can find a handy cheat sheet to help you remember how to interpret models here.

Which algorithm to use for percentage features in my DV and IV, in regression?

I am using regression to analyze server data to find feature importance.
Some of my IVs (independent variables) or Xs are in percentages like % of time, % of cores, % of resource used, while others are in numbers like number of bytes, etc.
I standardized all my Xs with (X-X_mean)/X_stddev. (Am I wrong in doing so?)
Which algorithm should I use in Python in case my IVs are a mix of numeric and %s and I predict Y in the following cases:
Case 1: Predict a continuous valued Y
a.Will using a Lasso regression suffice?
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a
%?
Case 2: Predict a %-ed valued Y, like "% resource used".
a. Should I use Beta-Regression? If so which package in Python offers
this?
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a
%?
If I am wrong in standardizing the Xs which are % already, is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? So that means I do not standardize them, I will still standardize the other numeric IVs.
Final Aim for both Cases 1 and 2:
To find the % of impact of IVs on Y.
e.g.: When X1 increases by 1 unit, Y increases by 21%
I understand from other posts that we can NEVER add up all coefficients to a total of 100 to assess the % of impact of each and every IV on the DV. I hope I am correct in this regard.

Having a mix of predictors doesn't matter for any form of regression, this will only change how you interpret the coefficients. What does matter, however, is the type/distribution of your Y variable
Case 1: Predict a continuous valued Y
a.Will using a Lasso regression suffice?
Regular OLS regression will work fine for this
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
The interpretation of coefficients always follows a format like "for a 1 unit change in X, we expect an x-coefficient amount of change in Y, holding the other predictors constant"
Because you have standardized X, your unit is a standard deviation. So the interpretation will be "for a 1 standard deviation change in X, we expect an X-coefficient amount of change in Y..."
c. How do I interpret the X-coefficient if X is standardized and is a
%?
Same as above. You units are still standard deviations, despite it originally coming from a percentage
Case 2: Predict a %-ed valued Y, like % resource used.
a. Should I use Beta-Regression? If so which package in Python offers
this?
This is tricky. The typical recommendation is to use something like binomial logistic regression when your Y outcome is a percentage.
b. How do I interpret the X-coefficient if X is standardized and is a
numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a
%?
Same as interpretations above. But if you use logistic regression, they are in the units of log odds. I would recommend reading up on logistic regression to get a deeper sense of how this works
If I am wrong in standardizing the Xs which are a % already , is it
fine to use these numbers as 0.30 for 30% so that they fall within the
range 0-1? So that means I do not standardize them, I will still
standardize the other numeric IVs.
Standardizing is perfectly fine for variables in regression, but like I said, it changes your interpretation as your unit is now a standard deviation
Final Aim for both cases 1 & 2:
To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit,
Y increases by 21%
If your Y is a percentage and you use something like OLS regression, then that is exactly how you would interpret the coefficients (for a 1 unit change in X1, Y changes by some percent)

Your question confuses some concepts and jumbles a lot of terminology. Essentially you're asking about a) feature preprocessing for (linear) regression, b) the interpretability of linear regression coefficients, and c) sensitivity analysis (the effect of feature X_i on Y). But be careful because you're making a huge assumption that Y is linearly dependent on each X_i, see below.
Standardization is not an "algorithm", just a technique for preprocessing data.
Standardization is needed for regression, but it is not needed for tree-based algorithms (RF/XGB/GBT) - with those, you can feed in raw numeric features directly (percents, totals, whatever).
(X-X_mean)/X_stddev is not standardization, it's normalization.
(An alternative to that is (true) standardization which is: (X-X_min)/(X_max-X_min), which transforms each variable into the range [0,1]; or you can transform to [0,1].
Last you ask about sensitivity analysis in regression : Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?
Stop and think about your underlying linearity assumption in "Final Aim for both cases 1 & 2: To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%".
you're assuming that the Dependent Variable has a linear relationship with each Independent Variable. But that is often not the case, it may be nonlinear. For example, if you're looking at the effect of Age on Salary, you would typically see it increase up to 40s/50s, then decrease gradually, and when you hit retirement age (say 65), decrease sharply.
so, you would model the effect of Age on Salary as quadratic or higher-order polynomial, by throwing in Age^2 and maybe Age^3 terms (or else sometimes you might see sqrt(X), log(X), log1p(X), exp(X) etc. terms. Anything that best captures the nonlinear relationship. You may also see variable-variable interaction terms, although regression strictly assumes variables are not correlated with each other.)
obviously, Age has a huge effect on Salary, but we would not measure the sensitivity of Salary to Age by combining the (absolute value of) coefficients of Age, Age^2, Age^3.
if we only had a linear term for Age, the single coefficient for Age would massively understate the influence of Age on Salary, it would net "average out" the strong positive relationship for the regime Age<40 versus the negative relationship for Age>50
So the general answer to "Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?" is "Only if the relationship between Y and that X_i is linear, otherwise no".
In general, a better and easier way to do sensitivity analysis (without assuming linear response, or needing standardization of % features) is tree-based algorithms (RF/XGB/GBT) which generate feature importances.
As an aside, I understand your exercise tells you to use regression, but in general you get better faster feature-importance information from tree-based (RF/XGB), especially for a shallow tree (small value for max_depth, large value of nodesize e.g. >0.1% of training-set size). That's why people use it, even when their final goal is regression.
(Your question is would get better answers over at CrossValidated, but it's fine to leave here on SO, there is a crossover).

A variance of how much is acceptable for a column before we can use the column variable for Modelling?

I am trying to create a classification model. While pre-processing the data. I look at the variance in each column. This is the amount of variance in each column. I am confused on which all columns should I log transform before modelling. How much variance is acceptable? Could somebody please shed some light on this please.
Temparature 2.318567e-01
HR 4.747868e+02
SpO2 1.179291e+01
SBP 6.263887e+02
MAP 2.905884e+02
RR 2.794205e+01
FiO2 9.061920e+00
PaO2 1.327011e+03
PaCO2 7.466527e+01
pH 4.851681e-03
A.a.gradient 0.000000e+00
HCO3 1.358290e+01
Hb 5.337076e+00
TLC 6.326940e+07
Platelets 1.062145e+10
K 3.332203e-01
Na 4.429681e+01
Serum.Cr 1.897277e+00
Blood.Urea 7.321509e+02
Bili 3.352918e+00
Urine.output 5.157271e+05
Lactate 3.795719e+00
INR 5.362644e-01
dtype: float64

I would say that looking only at variance of columns is mostly useful to delete columns with 0 variance.
If your column has at least minimal variance, you cannot conclude that the column is useless without further investigation.

I would say it depends on the priors you have on the data. There isn't an "acceptable range of variance" unless it comes with context.
For classification purposes it is best to train on as much samples as you can, but you do want to leave some for validation as #desertnaut has suggested.
buttom line: I would take the upper (say) 80% most variable columns and log transform them. The other 20% will remain for validation.

Convert independent sklearn GaussianMixture log probability scores to probabilities summing to 1

I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.
What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).
In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.
For a new point, I attempted to compute the probability of its label in a couple ways:
GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.
GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:
2.904136, -60.881554, -20.824841, -30.658509
This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?

In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.
Let V be your list of scores and tau be the temperature. Then,
p = np.exp(V/tau) / np.sum(np.exp(V/tau))
is your answer.
PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.