Removing Outliers in a Multi-feature Regression Problem

Removing Outliers in a Multi-feature Regression Problem - python

I have a regression problem having 1 target and 10 features. When I look at the outliers for each feature by box-plot, they have different number of outliers. While removing outliers, do I need to also remove the associated target values with those outliers?
I mean, let's say: for #1 feature I have 12 outliers and I removed them with 12 target values. Then, for #2 feature I have 23 outliers and I removed them with 23 target values, as well, and so on. The procedure would be like this, or how should I proceed?

I imagine each row of your data contains an ID, the value of the target and 10 feature values, one of each feature. To answer our question: if you want to remove the outliers, you have to remove the whole observation/row - the value that you classify as an outlier, the corresponding target value, as well as all other 9 corresponding feature values. So you would have to filter each row for the entry of feature_i being smaller than the threshold_i that you defined as an outlier.
The reason is that a multilinear regression calculates the influence of an incremental change in one feature on the target, assuming all other 9 features being constant. Removing a single feature value without removing the target and the other features of this observation simply does not work in such a model (assuming you are using an OLS).
However, I would be cautious with removing outliers. I don't know about your sample size and what you consider an outlier and it would help to know more about your research question, data and methodology.

Related

Variable selection involving mixture of numerical, high cardinal,low cardinal features

Consider a dummy dataframe:
A B C D …. Z
1 2 as we 2
2 4 qq rr 5
4 5 tz rc 9
This dataframe has 25 independent variables and one target variable ,the independent variables are mixture of high cardinal features, numerical features and low cardinal features, and the target variable is numerical. Now I first want to select or filter variables which are helpful in predicting the target variable. Any suggestions or tips towards achieving this goal is appreciated. Hope my question is clear, if the form of question is unclear I welcome the suggestions to make correction.
What I tried so far?
I applied target mean encoding (smooth mean) on the categorical features w.r.t target variable. Then I applied random forest to know variable importance. And the weird thing is that the random forest is selecting only one feature all the time, I expected at least 3-4 meaningful variables. I tried neural networks but the result is no different , what would be reason for this? What does that mean if the algorithms only using one variable? And the test predictions are not very accurate. The RMSE is about 2.4 where the target feature usually range from 20-40 in value. Thank you for your patience on reading this.
P.S: I am using SKlearn and in python.

Detecting and Replacing Outliers

In my mind, there are multiple ways to treat dataset outliers
> -> Delete data
> -> Transforming using log or Bin
> -> using mean median
> -> Test separately
I have a dataset of around 50000 observations and each observation has quite some outlier values (some variable have small amount of outliers some has 100-200 outliers) so excluding data is not the one I'm looking for as it causing me to loose a huge chunk of data.
I read somewhere that using mean and median is for artificial outliers but in my case I think the outliers are Natural
I was actually about to use median to get rid of the outliers and then using mean to fill in missing values but it doesn't seem ok, however I did use it neverthless with this code
median = X.median()
std =X.std()
outliers = (X - median).abs() > std
X.outliers = np.nan
X.fillna(median, inplace = True)
it did lower the overfitting of just one model logistic regression but still gives 100% on Random Forest and the shape of graph changed from
to this
So I'm really confuse what technique to use? I tried replacing 5th and 95th percentile of data as well but it didn't work as well. Should I bin the data present in each column from 1-10? Also should I normalize or standardize my data before applying any model? Any guidance will be appreciated

Check robust statistics.
I would suggest to check the Huber's method / winsorization, which you also have in Python.
For Hypothesis testing you have Wilcoxon signed ranked test and I think Mann-Whitney test

Use Unsupervised Nearest Neighbors with NaN

I want to use unsupervised nearest neighbors and I have NaN in my data. I want that when a feature for a record is NaN, it does not count for the distance with any other record. Filling NaN with 0, would make it close to other records with a value close of 0 and far from value far from 0, so it would not work.
I created a Euclidean metric that does that since NaN propagate for - and **, but are 0 for nansum. However, I am still getting an error due to the NaN.
Is there any way to fix this error? I would consider using another module than sklearn if needed.
from sklearn.neighbors import NearestNeighbors
def metric(x1,x2):
return np.nansum((x1-x2)**2)
nn = NearestNeighbors(n_neighbors=10, metric=metric, n_jobs=-1)
nn.fit(x)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
What I mean is that I want that if a record has a NaN for the 10th feature (for example), the 10th feature does not count in the distance with any other record, so the record will be equally close to any other record no matter if they have a -1, 0, 13 or any other number for the 10th feature.
Dropping records with NaN would not work, it would actually drop all records. Setting NaN to 0 or any other numbers would not work either. I want to mask the NaN out of the sum of the distances for all features.

I had the same problem when implementing a kNN classifier for data with missing values. When calling the fit() method, scikit-learn checks if there are nans in the data and then raises the error. I didnt found a solution and ended up with writing my own kNN classifier.
Assuming your data is scaled to 0 mean and unity variance, replacing nan with 0 is not a good idea as you already stated. So, I also decided to ignore a feature in the distance computation between two samples if at least one value is nan. However, this increases the chance that samples with many missing values have small distances to other samples. So it makes sense to normalize the distance by the number of features where both samples are complete and only consider samples as nearest neighbors when a minimum amount of features have values in both samples.

Convert independent sklearn GaussianMixture log probability scores to probabilities summing to 1

I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.
What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).
In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.
For a new point, I attempted to compute the probability of its label in a couple ways:
GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.
GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:
2.904136, -60.881554, -20.824841, -30.658509
This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?

In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.
Let V be your list of scores and tau be the temperature. Then,
p = np.exp(V/tau) / np.sum(np.exp(V/tau))
is your answer.
PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.

python sklearn: what is the different between "sklearn.preprocessing.normalize(X, norm='l2')" and "sklearn.svm.LinearSVC(penalty='l2')"

here is two method of normalize :
1:this one is using in the data Pre-Processing: sklearn.preprocessing.normalize(X, norm='l2')
2:the other method is using in the classify method : sklearn.svm.LinearSVC(penalty='l2')
i want to know ,what is the different between them? and does the two step must be used in a completely model ? is it right that just use a method is enough?

These 2 are different things and you normally need them both in order to make a good SVC model.
1) The first one means that in order to scale (normalize) the X data matrix you need to divide with the L2 norm of each column, which is just this : sqrt(sum(abs(X[:,j]).^2)) , where j is each column in your data matrix X . This ensures that none of the values of each column become too big, which makes it tough for some algorithms to converge.
2) Irrespective of how scaled (and small in values) your data is, there still may be outliers or some features (j) that are way too dominant and your algorithm (LinearSVC()) may over trust them while it shouldn't. This is where L2 regularization comes into play , that says apart from the function the algorithm minimizes, a cost will be applied to the coefficients so that they don't become too big . In other words the coefficients of the model become additional cost for the SVR cost function. How much cost ? is decided by the C (L2) value as C*(beta[j])^2
To sum up, first one tells with which value to divide each column of the X matrix. How much weight should a coefficient burden the cost function with is the second.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.