Generating adversarial data from cleverhans attack models

Generating adversarial data from cleverhans attack models - python

I want a code example to how to generate train data from clever hans' adversarial attacks.
adv_x = fgsm.generate_np(X_test, **fgsm_params)
This generates adversarial x data but how can I get y?
adv_pred = model.predict_classes(adv_x)
And this will give the "fooled" results right?
What I want is to correctly show generated x, y, fooled y (by which I mean results of models predictions that may be false because of the attack). I'm using Mnist btw, if it helps.

Based on the code snippets you shared, I would make two suggestions:
It is generally not a good idea to train the model on test data (if you are going to use that test data to evaluate its performance afterwards) so I would replace X_test by X_train in your first line.
To get the label for your adversarial examples, you can use the original labels of the training data or the predictions of the model on the original training data model.predict_classes(X_train) (this assumes that the adversarial example is not perturbed enough to change the label of the input).

Related

CountVectorizer test data loss?

I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?

Splitting train test sets for Node2vec link prediction in Stellargraph

I'm trying to understand how to use Stellargraph's EdgeSplitter class. In particular, the examples on the documentation for training a link prediction model based on Node2Vec splits the graph in the following parts:
Distrution of samples across train, val and test set
Following the examples on the documentation, first you sample 10% of the links of the full graph in order to obtain the test set:
# Define an edge splitter on the original graph:
edge_splitter_test = EdgeSplitter(graph)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from graph, and obtain the
# reduced graph graph_test with the sampled links removed:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global"
)
As far as I understand from the docs, graph_test is the original graph but with the test links removed. Then you perform the same operation with the training set,
# Do the same process to compute a training subset from within the test graph
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
p=0.1, method="global"
)
Following the previous logic, graph_train corresponds to graph_test with the training links removed.
Further down the code, my understanding is that we use graph_train to train the embedding and the training samples (examples, labels) to train the classifier. So I have several questions here:
Why are we using disjoint sets of training data to train different parts of the model? Shouldn´t we train both the embedding and the classifier with the full training set of links?
Why is the test set so big? Wouldn´t it be better to have most samples in the training set?
What is the correct way of using the EdgeSplitter class?
Thanks you in advance for your help!

Why disjoint sets:
This may or may not matter depending on the embedding algorithm.
The risk with edges that are both seen by the embedding algorithm and the classifier as targets is that the embedding algorithm may encode non-generalizable features.
For example, theoretically one feature of the embedding could be the node id, and then you could have other features encoding the entire neighborhood of the node. When combining two node's embeddings into a link vector in a weird way, or when using a multilayer model, one could therefore create a binary feature which is 1 if the two nodes are connected during embedding training and 0 otherwise.
In this case the classifier would perhaps just learn to use this trivial feature which is not present (i.e. has value 0) when you go to the test data.
The above would not happen in a real scenario, but more subtle features could have the same effect to a lesser degree.
In the end, this only risks to make model selection bad.
That is, the first split is to make the test reliable. The second split is to improve model selection. You can therefore omit the second split if you wish.
Why test set so big:
You are likely to get higher score with a bigger train set. As long as the experiment is repeated with different splits and variance is under control, it should be fine to increase train size.
What is the correct way to use EdgeSplitter:
I dont know what 'correct' means here. I think graph splitting is still an active research field.

Scikit correct way to calibrate classifiers with CalibratedClassifierCV

Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.
If they must be disjoint, is it legitimate to train the classifier with the following?
model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)
I fear that by using the same training set I'm breaking the disjoint data rule. An alternative might be to have a validation set
my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)
Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?
Question: what is the correct way to use CalibratedClassifierCV?

I already answered this in CrossValidated to the exact same question. I'm leaving it here anyways since it is not clear for me whether this question belongs here or to CrossVal.
--- Original answer ---
There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways it can be used:
base_estimator: If cv=prefit, the classifier must have been fit already on data.
cv: If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.
I may obviously be interpreting this wrong, but it appears you can use the CCCV (short for CalibratedClassifierCV) in two ways:
Number one:
You train your model as usual, your_model.fit(X_train, y_train).
Then, you create your CCCV instance, your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). Notice you set cv to flag that your model has already been fit.
Finally, you call your_cccv.fit(X_validation, y_validation). This validation data is used solely for calibration purposes.
Number two:
You have a new, untrained model.
Then you create your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). Notice cv is now the number of folds.
Finally, you call cccv_instance.fit(X, y). Because your model is untrained, X and y have to be used for both training and calibration. The way to ensure the data is 'disjoint' is cross validation: for any given fold, CCCV will split X and y into your training and calibration data, so they do not overlap.
TLDR: Method one allows you to control what is used for training and for calibration. Method two uses cross validation to try and make the most out of your data for both purposes.

Retrieve list of training features names from classifier

Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data.
The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.

I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.
Create and fit your model. For example
model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)
Then you can add an attribute which is the 'feature_names' since you know them at training time
model.feature_names = list(X_train.columns.values)
I typically then put the model into a binary file to pass it around but you can ignore this
joblib.dump(model, filename)
loaded_model = joblib.load(filename)
Then you can get the feature names back from the model to use them when you predict
f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])

Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.
Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.

You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.
The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.
If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.
What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.
However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.
You can look here to see how you can retrieve the names of the most important features you used.

You can extract feature names from a trained XGBOOST model as follows:
model.get_booster().feature_names

How to get ordered list of labels after fitting sklearn

train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?

tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.

Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.