naive bayes classifier dynamic training - python

Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)

Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
else:
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski

Related

How do I return the result of each cross validation prediction

I have a task that requires me to analyse a model but I need the output predictions for each cross validation step- and the data that the cross validation used in that step.
Here is my code:
results= cross_validate(MLPClassifier, X_train, y_train, cv=5,return_estimator = True)
Which did not work. Also,
results= cross_val_predict(MLPClassifier, X_train, y_train, cv=5)
Neither worked, however the second method gave me the a set of predictions that are the shape of y_train (labels). However I expected a smaller value to be returned say 10% the size of y_train.
Also I'm unsure how to obtain the data used for each cross validation step.
How about using one of the Cross Validation iterators?
from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.neural_network import MLPClassifier
X, y = make_classification(n_samples=1000, random_state=0)
datasets = {} # [(X_train, y_train), (X_test, y_test)]
results = {}
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for idx, (train_index, test_index) in enumerate(ss.split(X)):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
datasets[f"train_{idx}"] = X_train, y_train
datasets[f"test_{idx}"] = X_test, y_test
model = MLPClassifier(random_state=0).fit(X_train, y_train)
results[f"accuracy_{idx}"] = model.score(X_test, y_test)
results
Output:
{'accuracy_0': 0.968,
'accuracy_1': 0.924,
'accuracy_2': 0.94,
'accuracy_3': 0.944,
'accuracy_4': 0.964}

prediction scores on test vs train dataset different with/without gridsearchcv

I am exploring the use of GridSearchCV from sklearn to predict data. After the fit of the data using RandomForestRegressor, I calculate the score (MSE) for the test and the train data. I can see there is a huge difference between the MSE of train and the MSE of test (even if the scores should be similar).
Here is the code:
# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create Regressors Pipeline
pipeline_estimators = Pipeline([
('RandomForest', RandomForestRegressor()),
])
param_grid = [{'RandomForest__n_estimators': np.linspace(50, 100, 3).astype(int)}]
search = GridSearchCV(estimator = pipeline_estimators,
param_grid = param_grid,
scoring = 'neg_mean_squared_error',
cv = 2,)
search.fit(X_train, y_train)
y_test_predicted = search.best_estimator_.predict(X_test)
y_train_predicted = search.best_estimator_.predict(X_train)
print('MSE test predict', metrics.mean_squared_error(y_test, y_test_predicted))
print('MSE train predict',metrics.mean_squared_error(y_train, y_train_predicted))
The OUT are:
MSE test predict 0.0021045875412650343
MSE train predict 0.000332850878980335
IF I don't use Gridsearchcv but a FOR loop for the differet 'n_estimators', the MSE scores obtained for the predicted test and the train are very close.
To add more detail related to the 'FOR loop' explanation, this is by using simple approach, see code below:
n_estimators = np.linspace(50, 100, 3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
mse_train = []
mse_test = []
for val_n_estimators in n_estimators:
regressor = RandomForestRegressor(n_estimators = val_n_estimators)
regressor.fit(X, y)
y_pred = regressor.predict(X_test)
y_test_predicted = regressor.predict(X_test)
y_train_predicted = regressor.predict(X_train)
mse_train.append(metrics.mean_squared_error(y_train, y_train_predicted))
mse_test.append(metrics.mean_squared_error(y_test, y_test_predicted)) code here
For this code, the mse_train and mse_test are very similar. But using the Gridseachcv (see code on the top of the post), they are not similar.
Any suggestions?
Why there is such scores difference using GridSearchCV?
Thank you.
Marc

Repeated holdout method

How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method
So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).

ValueError: Found input variables with inconsistent numbers of samples: [676, 540]

X_train, X_test, y_train, y_test = train_test_split(features, df['Label'], test_size=0.2, random_state=111)
print (X_train.shape) # (540, 4196)
print (X_test.shape) # (136, 4196)
print (y_train.shape) # (540,)
print (y_test.shape) # (136,)
When fitting, it gives error:
from sklearn.svm import SVC
classifier = SVC(random_state = 0)
classifier.fit(features,y_train)
y_pred = classifier.predict(features)
Error:
ValueError: Found input variables with inconsistent numbers of samples: [676, 540]
I tried this.
You want to call the fit function with you X_train, not with features. The error occurs because features and y_train don't have the same size.
X_train, X_test, y_train, y_test = train_test_split(features, df['Label'], test_size=0.2, random_state=111)
print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)
from sklearn.svm import SVC
classifier = SVC(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
You'll likely also want to call predict with X_test or X_train. You may want to learn a bit more about train/test splits and why they are used.
Why are you using the features along y_train for the .fit()? I think you are supposed to use X_train instead.
Instead of
classifier.fit(features, y_train)
Use:
classifier.fit(X_train, y_train)
You are trying to use two sets of data with different shape, since you did the split earlier. So features has more samples than y_train.
Also, for you predict line. It should be:
.predict(x_test)

Feature selection from sklearn logisitc regression

I have created a binary classification model for a text using sklearn logistic regression model. Now I want to select the features used for model. My code looks like this-
train, val, y_train, y_test = train_test_split(np.arange(data.shape[0]), lab, test_size=0.2, random_state=0)
X_train = data[train]
X_test = data[val]
#X_train, X_test, y_train, y_test = train_test_split(data, lab, test_size=0.2)
tfidf_vect = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
X_tfidf_train = tfidf_vect.fit_transform(X_train)
X_tfidf_test = tfidf_vect.transform(X_test)
clf_lr = LogisticRegression(penalty='l1')
clf_lr.fit(X_tfidf_train, y_train)
feature_names = tfidf_vect.get_feature_names()
print len(feature_names)
y_pred_lr = clf_lr.predict_proba(X_tfidf_test)[:, 1]
What will be the best approach to do this.
you can use sklearn.feature_selection
here's a link of how you can use it
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE

Categories

Resources