One-class Classification - python

I have more than 2500 samples on which static analysis has been performed, with more than 300 features extracted per sample.
Among these samples, I have discriminated more than 10 APT class and my aim is to build, for each class, a one-class classifier.
I'm using python scikit library for machine-learning, and in particular i'm facing with One-class SVM.
First question: There exist some other good one-class classifier for this approach?
Second question: I have to come up with some metrics that can define a sort of "accuracy" of the classifier. Now I know that for one-class SVM the accuracy concept is not so well-define. I report my code and my concept:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.read_csv('features_labeled_apt17.csv')
X = df.ix[:,1:341].values
X_train, X_test = train_test_split(X,test_size = 0.3,random_state = 42)
clf = svm.OneClassSVM(nu=0.1,kernel = "linear", gamma =0.1)
y_score = clf.fit(X_train)
pred = clf.predict(X_test)
print(pred)
These represents the output of the code:
[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1
1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1]
The 1 represent of course the well-labeled sample, while the -1 represent the wrong one.
First: do you think this can be a good approach?
Second: For metrics, if I divide the total element in the testing set by the wrong labeled?

In my understanding in machine learning algorithms, your use case is not a good one to apply oneclass-SVM classifier.
Normally, oneclass-svm is used for Unsupervised Outlier Detection problems. Refer this page to see the implementation of oneclass-svm to detect outliers.
Just display your data-frame, I will find any new approach to solve your problem.

Related

Correlation between binary variables in pandas

I am trying to calculate the correlation between binary variables using Cramer's statistics:
def cramers_corrected_stat(confusion_matrix):
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
However I do not know how to apply the code above within my dataset:
CL UP NS P CL_S
480 1 0 1 0 1
1232 1 0 1 0 1
2308 1 1 1 0 1
1590 1 0 1 0 1
497 1 1 0 0 1
... ... ... ... ... ...
1066 1 1 1 0 1
1817 1 0 1 0 1
2411 1 1 1 0 1
2149 1 0 1 0 1
1780 1 0 1 0 1
I would appreciate your help in guiding me
The function you made is not proper for your dataset.
So, use the follow function cramers_V(var1,var2) given as follows.
from scipy.stats import chi2_contingency
def cramers_V(var1,var2):
crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
obs = np.sum(crosstab) # Number of observations
mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
return (stat/(obs*mini))
example code using the function is as fllows.
cramers_V(df["CL"], df["NS"])
If you want to calculate all possible pairs of your dataset, use this code.
import itertools
for col1, col2 in itertools.combinations(df.columns, 2):
print(col1, col2, cramers_V(df[col1], df[col2]))

Getting "Perfect separation detected, results not available" while building the Logistic Regression model

As part of my assignment I am building logistic regression model but I am getting an error "Perfect separation detected, results not available" while building it.
**X_train :-**
year amt_spnt rank
1 -1.723034 -0.418500 0.272727
2 0.716660 2.088507 -0.636364
3 1.174102 -0.558333 -1.545455
4 -0.503187 -1.297451 1.181818
5 1.326583 -0.628250 -1.545455
**y_train :-**
1 0
2 1
3 1
4 0
5 1
Name: result, dtype: int64
**Logistic Model code:-**
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
**Dataset before and after scaling**
**Image for evidence:-**
[![Evidence][1]][1]
[1]: https://i.stack.imgur.com/cTncA.png
This is a model setting issue, because of the perfect separation, your model can not converge. Perfect separation means there is one (or more) variable in your independent variables that can perfectly distinct dependent variable = 0 from dependent variable = 1. See the following example:
Y 0 0 0 0 0 0 1 1 1 1
X 1 2 3 4 4 4 5 6 7 8
If X <= 4, Y = 0
If X > 4, Y = 1
A short answer to your question is to find such variable in your independent variable and remove it from your model.

GridSearchCV for number of neurons

I am trying to learn by myself how to grid-search number of neurons in a basic multi-layered neural networks. I am using GridSearchCV and KerasClasifier of Python as well as Keras. The code below works for other data sets very well but I could not make it work for Iris dataset for some reasons and I cannot find it why, I am missing out something here. The result I get is:
Best: 0.000000 using {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 5}
from pandas import read_csv
import numpy
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.model_selection import GridSearchCV
dataframe=read_csv("iris.csv", header=None)
dataset=dataframe.values
X=dataset[:,0:4].astype(float)
Y=dataset[:,4]
seed=7
numpy.random.seed(seed)
#encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#one-hot encoding
dummy_y = np_utils.to_categorical(encoded_Y)
#scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
def create_model(n_neurons=1):
#create model
model = Sequential()
model.add(Dense(n_neurons, input_dim=X.shape[1], activation='relu')) # hidden layer
model.add(Dense(3, activation='softmax')) # output layer
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, initial_epoch=0, verbose=0)
# define the grid search parameters
neurons=[3, 5]
#this does 3-fold classification. One can change k.
param_grid = dict(n_neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, dummy_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
For the purpose of illustration and computational efficiency I search only for two values. I sincerely apologize for asking such a simple question. I am new to Python, switched from R, by the way because I realized that Deep Learning community is using python.
Haha, this is probably the funniest thing I ever experienced on Stack Overflow :) Check:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
and you should see different behavior. The reason why your model get a perfect score (in terms of cross_entropy having 0 is equivalent to best model possible) is that you haven't shuffled your data and because Iris consist of three balanced classes each of your feed had a single class like a target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 (first fold ends here) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (second fold ends here)2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Such problems are really easy to be solved by every model - so that's why you've got a perfect match.
Try to shuffle your data before - this should result in an expected behavior.

Get most informative features from very simple scikit-learn SVM classifier

I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.
I have the following 'sample' containing features and class (status):
A B C D E F status
1 5 2 5 1 3 1
1 2 3 2 2 1 0
3 4 2 3 5 1 1
1 2 2 1 1 4 0
I saved the feature names as 'features':
A B C D E F
The features 'X':
1 5 2 5 1 3
1 2 3 2 2 1
3 4 2 3 5 1
1 2 2 1 1 4
And the status 'y':
1
0
1
0
Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.
import pandas as pd
import numpy as np
from sklearn import svm
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)
test_size = int(X.shape[0]/2)
clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00
My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.
I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.
From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order

N-Grams to array

For my thesis i am working on a machine learning project using Python which includes feature extraction from text. As a start I am trying to implement bi-grams using sci-kit learn.
Right now, when i process my data trough Countvectorizer, I get an array of just 1's and sometimes a bit more. E.g.:
`[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]`
I want to use these bi-grams to predict my target variable, which is categorical.
When i now execute my code, Python returns that the shape of my two arrays are not identical.
`[[1 3 2 ..., 1 1 1]] [ 0. 0. 1. 0. 0.]`
Can someone tell me what i am doing wrong? I am using this command for the bi-grams. The first part is a loop for every text (film plot) in the dataset.
plottext = [ row[8] ]
wordvec = CountVectorizer(ngram_range=(2,2), analyzer='word')
plotvec = wordvec.fit_transform(plottext).toarray()
matrix_terms = np.array(wordvec.get_feature_names())
matrix_freq = np.asarray(plotvec.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])
target = { 'Age': row[4] }
data.append((final_matrix, target))
# Convert categorial target variable to Y
(X, Ycat) = zip(*data)
vec = DictVectorizer(sparse=False)
Y = vec.fit_transform(Ycat)
#Extract textual features from plot
return (X, Y)
The error message i get
ValueError: could not broadcast input array from shape (2,830) into shape (2)

Categories

Resources