N-Grams to array

N-Grams to array - python

For my thesis i am working on a machine learning project using Python which includes feature extraction from text. As a start I am trying to implement bi-grams using sci-kit learn.
Right now, when i process my data trough Countvectorizer, I get an array of just 1's and sometimes a bit more. E.g.:
`[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]`
I want to use these bi-grams to predict my target variable, which is categorical.
When i now execute my code, Python returns that the shape of my two arrays are not identical.
`[[1 3 2 ..., 1 1 1]] [ 0. 0. 1. 0. 0.]`
Can someone tell me what i am doing wrong? I am using this command for the bi-grams. The first part is a loop for every text (film plot) in the dataset.
plottext = [ row[8] ]
wordvec = CountVectorizer(ngram_range=(2,2), analyzer='word')
plotvec = wordvec.fit_transform(plottext).toarray()
matrix_terms = np.array(wordvec.get_feature_names())
matrix_freq = np.asarray(plotvec.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])
target = { 'Age': row[4] }
data.append((final_matrix, target))
# Convert categorial target variable to Y
(X, Ycat) = zip(*data)
vec = DictVectorizer(sparse=False)
Y = vec.fit_transform(Ycat)
#Extract textual features from plot
return (X, Y)
The error message i get
ValueError: could not broadcast input array from shape (2,830) into shape (2)

Related

Stratified Sampling in Python without scikit-learn

I have a vector which contains 10 values of sample 1 and 25 values of sample 2.
Fact = np.array((2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2))
I want to create a stratified output vector where :
sample 1 is divided in 80% : 8 values of 1 and 20% : 2 values of 0.
sample 2 is divided in 80% : 20 values of 1 and 20% : 5 values of 0.
The expected output will be :
Output = np.array((0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1))
How can I automate this ? I can’t use the sampling function from scikit-learn because it is not for a machine learning experience.

Here is one way to get your desired result, with reproducibility of output added. We draw random index values for each of the two groups from the input (fact) array, without replacement. Then, we create a new output array where we assign 1's in locations corresponding to the drawn index values and assign 0's everywhere else.
import numpy as np
from numpy.random import RandomState
rng = RandomState(123)
fact = np.array(
(2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2),
dtype='int8'
)
idx_arr = np.hstack(
(
rng.choice(np.argwhere(fact == 1).flatten(), 8, replace=False),
rng.choice(np.argwhere(fact == 2).flatten(), 20, replace=False),
)
)
out = np.zeros_like(fact, dtype='int8')
np.put(out, idx_arr, 1)
print(out)
# [0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1]

Should the background dataset for shap be standardized?

So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)

Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)

Getting "Perfect separation detected, results not available" while building the Logistic Regression model

As part of my assignment I am building logistic regression model but I am getting an error "Perfect separation detected, results not available" while building it.
**X_train :-**
year amt_spnt rank
1 -1.723034 -0.418500 0.272727
2 0.716660 2.088507 -0.636364
3 1.174102 -0.558333 -1.545455
4 -0.503187 -1.297451 1.181818
5 1.326583 -0.628250 -1.545455
**y_train :-**
1 0
2 1
3 1
4 0
5 1
Name: result, dtype: int64
**Logistic Model code:-**
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
**Dataset before and after scaling**
**Image for evidence:-**
[![Evidence][1]][1]
[1]: https://i.stack.imgur.com/cTncA.png

This is a model setting issue, because of the perfect separation, your model can not converge. Perfect separation means there is one (or more) variable in your independent variables that can perfectly distinct dependent variable = 0 from dependent variable = 1. See the following example:
Y 0 0 0 0 0 0 1 1 1 1
X 1 2 3 4 4 4 5 6 7 8
If X <= 4, Y = 0
If X > 4, Y = 1
A short answer to your question is to find such variable in your independent variable and remove it from your model.

One-class Classification

I have more than 2500 samples on which static analysis has been performed, with more than 300 features extracted per sample.
Among these samples, I have discriminated more than 10 APT class and my aim is to build, for each class, a one-class classifier.
I'm using python scikit library for machine-learning, and in particular i'm facing with One-class SVM.
First question: There exist some other good one-class classifier for this approach?
Second question: I have to come up with some metrics that can define a sort of "accuracy" of the classifier. Now I know that for one-class SVM the accuracy concept is not so well-define. I report my code and my concept:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.read_csv('features_labeled_apt17.csv')
X = df.ix[:,1:341].values
X_train, X_test = train_test_split(X,test_size = 0.3,random_state = 42)
clf = svm.OneClassSVM(nu=0.1,kernel = "linear", gamma =0.1)
y_score = clf.fit(X_train)
pred = clf.predict(X_test)
print(pred)
These represents the output of the code:
[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1
1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1]
The 1 represent of course the well-labeled sample, while the -1 represent the wrong one.
First: do you think this can be a good approach?
Second: For metrics, if I divide the total element in the testing set by the wrong labeled?

In my understanding in machine learning algorithms, your use case is not a good one to apply oneclass-SVM classifier.
Normally, oneclass-svm is used for Unsupervised Outlier Detection problems. Refer this page to see the implementation of oneclass-svm to detect outliers.
Just display your data-frame, I will find any new approach to solve your problem.

Get most informative features from very simple scikit-learn SVM classifier

I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.
I have the following 'sample' containing features and class (status):
A B C D E F status
1 5 2 5 1 3 1
1 2 3 2 2 1 0
3 4 2 3 5 1 1
1 2 2 1 1 4 0
I saved the feature names as 'features':
A B C D E F
The features 'X':
1 5 2 5 1 3
1 2 3 2 2 1
3 4 2 3 5 1
1 2 2 1 1 4
And the status 'y':
1
0
1
0
Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.
import pandas as pd
import numpy as np
from sklearn import svm
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)
test_size = int(X.shape[0]/2)
clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00
My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.
I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.

From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

N-Grams to array - python

Related

Stratified Sampling in Python without scikit-learn

Should the background dataset for shap be standardized?

Getting "Perfect separation detected, results not available" while building the Logistic Regression model

One-class Classification

Get most informative features from very simple scikit-learn SVM classifier

Categories

Resources