Anomaly detection with mean shift sklearn - python

I'm trying to use mean shift from sklearn to find anomalies and outliers in a dataset. The datasets are signal values from sensors. I have a training dataset to train the algorithm and a test dataset containing dummy anomalies. My problem is that when I use the predict method on test dataset, mean shift doesn't label anomalies with -1 or any other value that indicates anomalies or outliers but associates them with valid cluster.
Here the code:
import pandas as pd
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn import preprocessing
if __name__ == '__main__':
train= pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
scaler = preprocessing.StandardScaler().fit(train)
bandwidth = estimate_bandwidth(train, n_jobs=-1)
ms = MeanShift(bandwidth=bandwidth,n_jobs=-1)
ms.fit(scaler.transform(train))
prediction = ms.predict(scaler.transform(test))
test["cluster"] = prediction
print np.unique(prediction)
here first 5 row training dataset:
A B C
0 300 0 200
1 300 0 200
2 300 0 350
3 300 1 350
4 400 1 350
here first 5 row test dataset with dummy anomalies:
A B C
0 300 0 200
1 300 0 200
2 300 0 350
3 100000000 100000000 100000000
4 400 1 350
what can i do to detect anomalies in test dataset?

Related

Should the background dataset for shap be standardized?

So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)

Split k-fold where each fold of validation data doesn't include duplicates

Let's say I have a pandas dataframe df. The df contains 1,000 rows. Like below.
print(df)
id class
0 0000799a2b2c42d 0
1 00042890562ff68 0
2 0005364cdcb8e5b 0
3 0007a5a46901c56 0
4 0009283e145448e 0
... ... ...
995 04309a8361c5a9e 0
996 0430bde854b470e 0
997 0431c56b712b9a5 1
998 043580af9803e8c 0
999 043733a88bfde0c 0
And it has 950 data as class 0 and 50 data as class 1.
Now I want to add one more column as fold, like below.
id class fold
0 0000799a2b2c42d 0 0
1 00042890562ff68 0 0
2 0005364cdcb8e5b 0 0
3 0007a5a46901c56 0 0
4 0009283e145448e 0 0
... ... ... ...
995 04309a8361c5a9e 0 4
996 0430bde854b470e 0 4
997 0431c56b712b9a5 1 4
998 043580af9803e8c 0 4
999 043733a88bfde0c 0 4
where the fold column contains 5 folds(0,1,2,3,4). And each fold has 200 data, where 190 data as class 0 and 10 data as class 1(by which means preserving the percentage of samples for each class).
I've tried StratifiedShuffleSplit from sklearn.model_selection, like below.
sss = StratifiedShuffleSplit(n_split=5, random_state=2021, test_size = 0.2)
for _, val_index in sss.split(df.id, df.class):
....
Then I regard every list of val_index as one specific fold, but it ends up giving me duplicates in each val_index.
Can someone help me?
What you need is a kfold used for cross validation, not a train test split. You can use StratifiedKFold, for example your dataset is like this:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
np.random.seed(12345)
df = pd.DataFrame({'id' : np.random.randint(1,1e5,1000),
'class' :np.random.binomial(1,0.1,1000)})
df['fold'] = np.NaN
We use the kfold, iterate through like you did and assign the fold number:
skf = StratifiedKFold(n_splits=5,shuffle=True)
for fold, [train,test] in enumerate(skf.split(df,df['class'])):
df.loc[test,"fold"] = fold
End product:
pd.crosstab(df['fold'],df['class'])
class 0 1
fold
0.0 182 18
1.0 182 18
2.0 182 18
3.0 182 18
4.0 181 19

How to get indices of instances during cross-validation

I am doing a binary classification. May I know how to extract the real indexes of the misclassified or classified instances of the training data frame while doing K fold cross-validation? I found no answer to this question here.
I got the values in folds as described here:
skf=StratifiedKFold(n_splits=10,random_state=111,shuffle=False)
cv_results = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')
fold_pred = [pred[j] for i, j in skf.split(X_train,y_train)]
fold_pred
Is there any method to get index of misclassified (or classified ones)? So the output is a dataframe that only has misclassified(or classified) instances while doing cross validation.
Desired output:
Missclassified instances in the dataframe with real indices.
col1 col2 col3 col4 target
13 0 1 0 0 0
14 0 1 0 0 0
18 0 1 0 0 1
22 0 1 0 0 0
where input has 100 instances, 4 are misclassified (index number 13,14,18 and 22) while doing CV
From cross_val_predict you already have the predictions. It's a matter of subsetting your data frame where the predictions are not the same as your true label, for example:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df = pd.DataFrame(data.data[:,:5],columns=data.feature_names[:5])
df['label'] = data.target
rfc = RandomForestClassifier()
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
pred = cross_val_predict(rfc, df.iloc[:,:5], df['label'], cv=skf)
df[df['label']!=pred]
mean radius mean texture ... mean smoothness label
3 11.42 20.38 ... 0.14250 0
5 12.45 15.70 ... 0.12780 0
9 12.46 24.04 ... 0.11860 0
22 15.34 14.26 ... 0.10730 0
31 11.84 18.70 ... 0.11090 0

Visualizing clusters result using PCA (Python)

I have a dataset containing 61 rows(users) and 26 columns, on which I apply clustering with k-means and others algorithms.
first applied KMeans on the dataset after normalizing it.
As a prior task I run k-means on this data after normalizing it and identified 10 clusters.
In parallel I also tried to visualize these clusters that's why i use PCA to reduce the number of my features.
I have written the following code:
UserID Communication_dur Lifestyle_dur Music & Audio_dur Others_dur Personnalisation_dur Phone_and_SMS_dur Photography_dur Productivity_dur Social_Media_dur System_tools_dur ... Music & Audio_Freq Others_Freq Personnalisation_Freq Phone_and_SMS_Freq Photography_Freq Productivity_Freq Social_Media_Freq System_tools_Freq Video players & Editors_Freq Weather_Freq
1 63 219 9 10 99 42 36 30 76 20 ... 2 1 11 5 3 3 9 1 4 8
2 9 0 0 6 78 0 32 4 15 3 ... 0 2 4 0 2 1 2 1 0 0
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(3)
pca.fit(X)
pca_data = pd.DataFrame(pca.transform(X))
print(pca_data.head())
gives the following results:
0 1 2
0 8 -4 5
1 -2 -2 1
2 1 1 -0
3 2 -1 1
4 3 -1 -3
I want to show a plot (cluster) of my dataset by using a PCA and interpret the results ?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again.
Using an example dataset:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
df, y = make_blobs(n_samples=70, centers=10,n_features=26,random_state=999,cluster_std=1)
Perform scaling, PCA and put the PC scores into a dataframe:
Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(2)
pca_data = pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2'])
Perform kmeans and place the label into a data frame and you can already plot it using seaborn:
kmeans =KMeans(n_clusters=10).fit(X)
pca_data['cluster'] = pd.Categorical(kmeans.labels_)
sns.scatterplot(x="PC1",y="PC2",hue="cluster",data=pca_data)
Or matplotlib:
fig,ax = plt.subplots()
scatter = ax.scatter(pca_data['PC1'], pca_data['PC2'],c=pca_data['cluster'],cmap='Set3',alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(),
loc="upper left", title="")
ax.add_artist(legend1)

How do I standardize only int64 columns after train-test split?

I have a dataframe ready for modelling, it contains continuous variables and one-hot-encoded variables
ID Limit Bill_Sep Bill_Aug Payment_Sep Payment_Aug Gender_M Gender_F Edu_Uni DEFAULT_PAYMT
1 10000 2000 350 1000 350 1 0 1 1
2 30000 3000 5000 500 500 0 1 0 0
3 20000 8000 10000 8000 5000 1 0 1 1
4 45000 450 250 450 250 0 1 0 1
5 60000 700 1000 700 1000 1 0 1 1
6 8000 300 5000 300 2000 1 0 1 0
7 30000 3000 10000 1000 5000 0 1 1 1
8 15000 1000 1250 500 1750 0 1 1 1
All the numerical variables are 'int64' while the one-hot-encoded variables are 'uint8'. The binary outcome variable is DEFAULT_PAYMT.
I have gone down the usual manner of train test split here, but i wanted to see if i could apply the standardscaler only for the int64 variables (i.e., the variables that were not one-hot-encoded)?
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
Am attempting the following code and seems to work, however, am not sure how to merge the categorical variables (that were not scaled) back into the X_scaled_tr and X_scaled_t arrays. Appreciate any form of help, thank you!
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_scaled_tr = X_train.select_dtypes(include=['int64'])
X_scaled_t = X_test.select_dtypes(include=['int64'])
X_scaled_tr = sc.fit_transform(X_scaled_tr)
X_scaled_t = sc.transform(X_scaled_t)
Managed to address the question with the following code where standardscaler is only applied to the continuous variables and NOT the one-hot-encoded variables
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('X_train', StandardScaler(), ['LIMIT','BILL_SEP','BILL_AUG','PAYMENT_SEP','PAYMENT_AUG'])], remainder ='passthrough')
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)

Categories

Resources