Outliers using RPCA

Outliers using RPCA - python

I read about using RPCA to find outliers on time series data. I have an idea about the fundamentals of what RPCA is about and the theory. I got a Python library that does RPCA and pretty much got two matrices as the output (L and S), a low rank approximation of the input data and a sparse matrix.
Input data:(rows being a day and 10 features as columns.)
DAY 1 - 100,300,345,126,289,387,278,433,189,153
DAY 2 - 300,647,245,426,889,987,278,133,295,153
DAY 3 - 200,747,145,226,489,287,378,1033,295,453
Output obtained :
L
[[ 125.20560531 292.91525518 92.76132814 141.33797061 282.93586313
185.71134917 199.48789246 96.04089205 192.11501055 118.68811072]
[ 174.72737183 408.77013914 129.45061871 197.24046765 394.84366245
259.16456278 278.39005349 134.0273274 268.1010231 165.63205458]
[ 194.38951303 454.76920678 144.01774873 219.43601655 439.27557808
288.32845493 309.71739782 149.10947628 298.27053871 184.27069609]]
S
[[ -25.20560531 0. 252.23867186 -0. 0.
201.28865083 78.51210754 336.95910795 -0. 34.31188928]
[ 125.27262817 238.22986086 115.54938129 228.75953235 494.15633755
727.83543722 -0. -0. 26.8989769 -0. ]
[ 0. 292.23079322 -0. 0. 49.72442192
-0. 68.28260218 883.89052372 0. 268.72930391]]
Inference: (My question)
Now how do I infer the points that could be classified as outliers. For ex. by looking at the data, we could say 1033 looks like an outlier. The corresponding entry in S matrix is 883.89052372 which is more compared to other entries in S. Could the notion of having a fixed threshold to find the deviations of S matrix entries from the corresponding original value in the input matrix be used to determine that the point is an outlier ? Or am I completely understanding the concept of RPCA wrong ? TIA for your help.

You understood the concept of robust PCA (RPCA) correctly: The sparse matrix S contains the outliers.
However, S will often contain many observations (non-zero values) you might not classify as anomalies yourself. As you suggest it is therefore a good idea to filter out these points.
Applying a fixed threshold to identify relevant outliers could potentially work for one dataset. However, using the threshold on many datasets might give poor results if there are changes in mean and variance of the underlying distribution.
Ideally you calculate an anomaly score and then classify the outliers based on that score. A simple method (and often used in outlier detection) is to see if your data point (potential outlier) is at the tail of your assumed distribution.
For example, if you assume your distribution is Gaussian you can calculate the Z-score (z):
z = (x-μ)/σ,
where μ is the mean and σ is the standard deviation.
You can then apply a threshold to the calculated Z-score in order to identify an outlier. For example: if for a given observation z > 3, the data point is an outlier. This means your observation is more than 3 standard deviations from the mean and it is in the 0.1% tail of the Gaussian distribution. This approach is more robust to changes in the data than using a threshold on the non-standardized values. Furthermore tuning the z value at which you classify the outlier is simpler than finding a real scale value (883.89052372 in your case) for each dataset.

Related

Different results using affinity propagation "precomputed" distance matrix

I am working with two-dimensional data
X = array([[5.40310335, 0. ],
[6.86136114, 6.56225717],
[0. , 0. ],
...,
[5.88838732, 0. ],
[6.0003473 , 0. ],
[6.25971331, 0. ]])
looking for clusters, using euclidean distance, i run affinity propagation from scikit learn with this raw data as follows
af = AffinityPropagation(damping=.9, max_iter=300, random_state=0).fit(X)
obtaining as a result 9 clusters.
I understand that when you want to use another distance you have to enter the negative distance matrix, and use affintity = 'precomputed' as it follows
af_c = AffinityPropagation(damping=.9, max_iter=300,
affinity='precomputed', random_state=0).fit(distM)
if as distM I use the Euclidean distance matrix calculated as follows
distM_E = -distance_matrix(X,X)
np.fill_diagonal(distM, np.median(distM))
completing the diagonal with the median since it is a predefined preference value also in the method.
Using this I am getting 34 clusters as a result and I would expect to have 9 as if working with the default distance. I don't know if I'm interpreting the way of entering the distance matrix correctly or if the library does something different when one uses everything predefined.
I would appreciate any help.

For DBSCAN python, is it mandatory to do Standardization and normalization both?

For DBSCAN implementation, is it necessary to have all the feature columns Standardized AND Normalized?
e.g.
[[ 664. , 703. , 2901.069079],
[ 632. , 717. , 2901.069079],
[ 606. , 740. , 4386.449399],
[ 635. , 751. , 4386.449399],
[ 672. , 525. , 4760.874001]]
If I have to do DBSCAN on this, is it mandatory to Standardize it firs then Normalize it? Just Normalize it?
Additionally, How these values dictate the choice of eps?

Normalizing or standardizing your data can ruin important properties of your data set.
Some examples:
your data are geo coordinates. Latitute and longitude must never be normalized or standardized
your data are histograms. The only meaningful normalization is to make the sum of the histogram 1. Never transform single variables!
your data has a meaningful zero. For example, it is a monetary value. Transforming with sgn(x)*sqrt(abs(x)) may be helpful in some domains though.
your data is sparse. Never standardize. (Normalization may be 'okay' if you do not have negative values.)
Choosing a scaling should not be done "because it is always done"; but because of the actual data you have! Choose it because it is the right thing, not because it is "default" or in some tutorial.
Most likely if you resort to normalization or standardization, you have not understood your data, nor how to measure distance or similarity; then people like to use normalization as a last resort to get "some" result; but you never know if the result is meaningful at all.

Correct way of normalizing and scaling the MNIST dataset

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:
Scale the data to the [0,1] range.
Normalize the data to have zero mean and unit standard deviation (data - mean) / std.
Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))
This outputs the following
Min Pixel Value: 0
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612
Scaled Std: 0.30810779333114624
However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:
[data - (mean / 255)] / (std / 255)
which is very different from this
[(scaled_data) - (mean/255)] / (std/255)
where scaled_data is just data / 255.

Euler_Salter
I may have stumbled upon this a little too late, but hopefully I can help a little bit.
Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('./data', train=True
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)
Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)
Please refer to the link below for better explanation.
https://pytorch.org/vision/stable/transforms.html

I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:
mean = 0, stdev = 1
data range [0,1]
Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.
Conversely, if you have mean=0, then some of the data must be negative.
You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.
For the [0,1] scaling, you simply divide by 255.
For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:
new_val = (old_val - old_mean) / old_stdev
Does that clarify it for you, or have I entirely missed your point of confusion?

Purpose
Two of the most important reasons for features scaling are:
You scale features to make them all of the same magnitude (i.e. importance or weight).
Example:
Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.
You scale features to speed up execution time.
This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)
Types
The most popular types of Feature Scalers can be summarized as follows:
StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler: Mainly used for sparse data.
Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Which One & How Many
You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.
Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.
Key Takeaways
There are different types of Feature Scalers that are used based on some rules of thumb mentioned above.
You pick one Scaler based on the requirements, not randomly.
You scale data for a purpose, for example, in the Random Forest Algorithm you do NOT usually need to scale.

Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied.
You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

Draw samples from KDE based on empirical data (to calculate entropy)

I am trying to calculate the entropy (scipy.stats.entropy) between two arrays of numerical values, to quantify the difference of their underlying distributions.
As calculating the entropy requires both inputs to have the same shape, I want to estimate the distribution of the smaller list using KDE to sample new data from it.
Working with inputs between 0 and 1e-02 I am not able to draw plausible numbers from the fitted KDE?
emp_values = np.array([0.000618, 0.000425, 0.000597, 0.000528, 0.000393, 0.000721,
0.000674, 0.000703, 0.000632, 0.000383, 0.000466, 0.000919,
0.001419, 0.00063 , 0.000433, 0.000516, 0.001419, 0.000655,
0.000674, 0.000676, 0.000694, 0.000396, 0.000688, 0.00061 ,
0.000687, 0.000633, 0.000601, 0.00061 , 0.000747, 0.000356,
0.000824, 0.000931, 0.000691, 0.000907, 0.000553, 0.000748,
0.000828, 0.000907, 0.000457, 0.000494])
kde_emp = KernelDensity().fit(emp_values.reshape(-1, 1))
Using KDE.sample to draw random numbers yields values completely out of range?
kde_emp.sample(10)
array([[-3.0811253 ],
[ 1.24822136],
[ 0.07815318],
[ 0.01609681],
[-0.59676707],
[-0.89988083],
[-0.59071966],
[-0.72741754],
[ 0.82296101],
[ 0.08329316]])
What would be then the appropriate way to draw 10.000 random samples from the fitted PDF?

the bandwidth of the KDE defaults to 1, which is far too large. try using something like this instead:
kde_emp = KernelDensity(bandwidth=5e-5)
kde_emp.fit(emp_values.reshape(-1, 1))
to set the bandwidth to something more sensible for this data. the wikipedia page talks about what bandwidth means
I don't know scikit-learn very well, but there doesn't seem to be any nice way of estimating this value automatically in it. i.e. you'd have to write your own estimator (~5 to 10 lines of code) yourself
that said, if you're just after a Gaussian KDE then SciPy has one that does estimate the bandwidth for you. see scipy.stats.gaussian_kde. note that the scipy estimator assumes the data is unimodal (as most do) whereas your data certainly isn't, so you'd need a smaller bandwidth value

Most important original feature(s) of Principal Component Analysis

I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)
Which outputs:
[[ 1. -1. -1. -1. ]
[ 1. -2. -1. -1. ]
[ 1. -3. -2. -1. ]
[ 1. 1. 1. -1. ]
[ 1. 2. 1. -1. ]
[ 1. 3. 2. -0.5]]
Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:
pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_
Output:
array([[ 0. , 0.8376103 , 0.54436943, 0.04550712],
[-0. , 0.54564656, -0.8297757 , -0.11722679]])
This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.
The question is, which feature is most important, which one second most etc? Can I use the component_ attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?

The component_ attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components.
However, the answer that you linked actually tells you what to use instead: the explained_variance_ratio_ attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component:
In [5]: pca.explained_variance_ratio_
Out[5]: array([ 0.98934303, 0.00757996])
This means that the first prinicpal component explaines almost 99 percent of the variance. You know from components_ that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1.
In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.