I want to use unsupervised nearest neighbors and I have NaN in my data. I want that when a feature for a record is NaN, it does not count for the distance with any other record. Filling NaN with 0, would make it close to other records with a value close of 0 and far from value far from 0, so it would not work.
I created a Euclidean metric that does that since NaN propagate for - and **, but are 0 for nansum. However, I am still getting an error due to the NaN.
Is there any way to fix this error? I would consider using another module than sklearn if needed.
from sklearn.neighbors import NearestNeighbors
def metric(x1,x2):
return np.nansum((x1-x2)**2)
nn = NearestNeighbors(n_neighbors=10, metric=metric, n_jobs=-1)
nn.fit(x)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
What I mean is that I want that if a record has a NaN for the 10th feature (for example), the 10th feature does not count in the distance with any other record, so the record will be equally close to any other record no matter if they have a -1, 0, 13 or any other number for the 10th feature.
Dropping records with NaN would not work, it would actually drop all records. Setting NaN to 0 or any other numbers would not work either. I want to mask the NaN out of the sum of the distances for all features.
I had the same problem when implementing a kNN classifier for data with missing values. When calling the fit() method, scikit-learn checks if there are nans in the data and then raises the error. I didnt found a solution and ended up with writing my own kNN classifier.
Assuming your data is scaled to 0 mean and unity variance, replacing nan with 0 is not a good idea as you already stated. So, I also decided to ignore a feature in the distance computation between two samples if at least one value is nan. However, this increases the chance that samples with many missing values have small distances to other samples. So it makes sense to normalize the distance by the number of features where both samples are complete and only consider samples as nearest neighbors when a minimum amount of features have values in both samples.
Related
I have a regression problem having 1 target and 10 features. When I look at the outliers for each feature by box-plot, they have different number of outliers. While removing outliers, do I need to also remove the associated target values with those outliers?
I mean, let's say: for #1 feature I have 12 outliers and I removed them with 12 target values. Then, for #2 feature I have 23 outliers and I removed them with 23 target values, as well, and so on. The procedure would be like this, or how should I proceed?
I imagine each row of your data contains an ID, the value of the target and 10 feature values, one of each feature. To answer our question: if you want to remove the outliers, you have to remove the whole observation/row - the value that you classify as an outlier, the corresponding target value, as well as all other 9 corresponding feature values. So you would have to filter each row for the entry of feature_i being smaller than the threshold_i that you defined as an outlier.
The reason is that a multilinear regression calculates the influence of an incremental change in one feature on the target, assuming all other 9 features being constant. Removing a single feature value without removing the target and the other features of this observation simply does not work in such a model (assuming you are using an OLS).
However, I would be cautious with removing outliers. I don't know about your sample size and what you consider an outlier and it would help to know more about your research question, data and methodology.
I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.
Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html
I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?
Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.
Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.
I'm using scikit-optimize to do a BayesSearchCV within my RandomForestClassifier hyperparameter space. One hyperparameter is supposed to also be 0 (zero) while having a log-uniform distribution:
ccp_alpha = Real(min(ccp_alpha), max(ccp_alpha), prior='log-uniform')
Since log(0) is impossible to calculate, it is apparently impossible to have the parameter take the value 0 at some point.
Consequently, the following error is thrown:
ValueError: Not all points are within the bounds of the space.
Is there any way to work around this?
Note that getting a 0 from a log-uniform distribution is not well defined. How would you normalize this distribution, or in other words what would the odds be of drawing a 0?
The simplest approach would be to yield a list of values to try with the specified distribution. Since the values in this list will be sampled uniformly, you can use any distribution you like. For example with the list
reals = [0,0,0,0,x1,x2,x3,x4]
wehere x1 to x4 are log-uniformly distributed will give you odds 4 / 8 of drawing a 0, and odds 4 / 8 of drawing a log-uniformly distributed value.
If you really wanted to, you could also implement a class called MyReal (probably subclassed from Real) that implements a rvs method that yields the distribution you want.
In my mind, there are multiple ways to treat dataset outliers
> -> Delete data
> -> Transforming using log or Bin
> -> using mean median
> -> Test separately
I have a dataset of around 50000 observations and each observation has quite some outlier values (some variable have small amount of outliers some has 100-200 outliers) so excluding data is not the one I'm looking for as it causing me to loose a huge chunk of data.
I read somewhere that using mean and median is for artificial outliers but in my case I think the outliers are Natural
I was actually about to use median to get rid of the outliers and then using mean to fill in missing values but it doesn't seem ok, however I did use it neverthless with this code
median = X.median()
std =X.std()
outliers = (X - median).abs() > std
X.outliers = np.nan
X.fillna(median, inplace = True)
it did lower the overfitting of just one model logistic regression but still gives 100% on Random Forest and the shape of graph changed from
to this
So I'm really confuse what technique to use? I tried replacing 5th and 95th percentile of data as well but it didn't work as well. Should I bin the data present in each column from 1-10? Also should I normalize or standardize my data before applying any model? Any guidance will be appreciated
Check robust statistics.
I would suggest to check the Huber's method / winsorization, which you also have in Python.
For Hypothesis testing you have Wilcoxon signed ranked test and I think Mann-Whitney test
so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!
When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.