I have created a binary classification model for a text using sklearn logistic regression model. Now I want to select the features used for model. My code looks like this-
train, val, y_train, y_test = train_test_split(np.arange(data.shape[0]), lab, test_size=0.2, random_state=0)
X_train = data[train]
X_test = data[val]
#X_train, X_test, y_train, y_test = train_test_split(data, lab, test_size=0.2)
tfidf_vect = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
X_tfidf_train = tfidf_vect.fit_transform(X_train)
X_tfidf_test = tfidf_vect.transform(X_test)
clf_lr = LogisticRegression(penalty='l1')
clf_lr.fit(X_tfidf_train, y_train)
feature_names = tfidf_vect.get_feature_names()
print len(feature_names)
y_pred_lr = clf_lr.predict_proba(X_tfidf_test)[:, 1]
What will be the best approach to do this.
you can use sklearn.feature_selection
here's a link of how you can use it
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE
Related
I have a task that requires me to analyse a model but I need the output predictions for each cross validation step- and the data that the cross validation used in that step.
Here is my code:
results= cross_validate(MLPClassifier, X_train, y_train, cv=5,return_estimator = True)
Which did not work. Also,
results= cross_val_predict(MLPClassifier, X_train, y_train, cv=5)
Neither worked, however the second method gave me the a set of predictions that are the shape of y_train (labels). However I expected a smaller value to be returned say 10% the size of y_train.
Also I'm unsure how to obtain the data used for each cross validation step.
How about using one of the Cross Validation iterators?
from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.neural_network import MLPClassifier
X, y = make_classification(n_samples=1000, random_state=0)
datasets = {} # [(X_train, y_train), (X_test, y_test)]
results = {}
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for idx, (train_index, test_index) in enumerate(ss.split(X)):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
datasets[f"train_{idx}"] = X_train, y_train
datasets[f"test_{idx}"] = X_test, y_test
model = MLPClassifier(random_state=0).fit(X_train, y_train)
results[f"accuracy_{idx}"] = model.score(X_test, y_test)
results
Output:
{'accuracy_0': 0.968,
'accuracy_1': 0.924,
'accuracy_2': 0.94,
'accuracy_3': 0.944,
'accuracy_4': 0.964}
I am exploring the use of GridSearchCV from sklearn to predict data. After the fit of the data using RandomForestRegressor, I calculate the score (MSE) for the test and the train data. I can see there is a huge difference between the MSE of train and the MSE of test (even if the scores should be similar).
Here is the code:
# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create Regressors Pipeline
pipeline_estimators = Pipeline([
('RandomForest', RandomForestRegressor()),
])
param_grid = [{'RandomForest__n_estimators': np.linspace(50, 100, 3).astype(int)}]
search = GridSearchCV(estimator = pipeline_estimators,
param_grid = param_grid,
scoring = 'neg_mean_squared_error',
cv = 2,)
search.fit(X_train, y_train)
y_test_predicted = search.best_estimator_.predict(X_test)
y_train_predicted = search.best_estimator_.predict(X_train)
print('MSE test predict', metrics.mean_squared_error(y_test, y_test_predicted))
print('MSE train predict',metrics.mean_squared_error(y_train, y_train_predicted))
The OUT are:
MSE test predict 0.0021045875412650343
MSE train predict 0.000332850878980335
IF I don't use Gridsearchcv but a FOR loop for the differet 'n_estimators', the MSE scores obtained for the predicted test and the train are very close.
To add more detail related to the 'FOR loop' explanation, this is by using simple approach, see code below:
n_estimators = np.linspace(50, 100, 3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
mse_train = []
mse_test = []
for val_n_estimators in n_estimators:
regressor = RandomForestRegressor(n_estimators = val_n_estimators)
regressor.fit(X, y)
y_pred = regressor.predict(X_test)
y_test_predicted = regressor.predict(X_test)
y_train_predicted = regressor.predict(X_train)
mse_train.append(metrics.mean_squared_error(y_train, y_train_predicted))
mse_test.append(metrics.mean_squared_error(y_test, y_test_predicted)) code here
For this code, the mse_train and mse_test are very similar. But using the Gridseachcv (see code on the top of the post), they are not similar.
Any suggestions?
Why there is such scores difference using GridSearchCV?
Thank you.
Marc
I have this in python
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# The gamma parameter is the kernel coefficient for kernels rbf/poly/sigmoid
svm = SVC(gamma='auto', probability=True)
svm.fit(X_train,y_train.values.ravel())
prediction = svm.predict(X_test)
prediction_prob = svm.predict_proba(X_test)
print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_prob[:,1]))
print(X_train)
print(y_train)
Now I want to build this with a different kernel rbf and store the values into arrays.
so something like this
def svm_grid_search(parameters, cv):
# Store the outcome of the folds in these lists
means = []
stds = []
params = []
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
for parameter in parameters:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# The gamma parameter is the kernel coefficient for kernels rbf/poly/sigmoid
svm = SVC(gamma=1,kernel ='rbf',probability=True)
svm.fit(X_train,y_train.values.ravel())
prediction = svm.predict(X_test)
prediction_prob = svm.predict_proba(X_test)
return means, stddevs, params
I know I want to loop around the parameters and then store the values into the lists.
But I struggle how to do so ...
So what I try to do is to loop and then store the results of the SVM in the arrays with the
kernel = parameter
I would be very thankful if you could help me out here.
This is what GridSearchCV is for. Link here
See here for an example
Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)
Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
else:
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski
I would like to do K-fold cross-validation. the code before K-fold cross validation is like this: and it working perfectly
df = pd.read_csv('finalupdatedothers-multilabel.csv')
X= df[['sentences']]
dfy = df[['ADR','WD','EF','INF','SSI','DI','others']]
df1 = dfy.stack().reset_index()
df1.columns = ['a','b','c']
y_train_text = df1.groupby('a')['b'].apply(list)
lb = preprocessing.MultiLabelBinarizer()
# Run classifier
stop_words = stopwords.words('english')
classifier=make_pipeline(CountVectorizer(),
TfidfTransformer(),
#SelectKBest(chi2, k=4),
OneVsRestClassifier(SGDClassifier()))
#combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
random_state = np.random.RandomState(0)
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train_text, test_size=.2,
random_state=random_state)
print y_train
# # Binarize the output classes
Y = lb.fit_transform(y_train)
Y_test=lb.transform(y_test)
classifier.fit(X_train, Y)
y_score = classifier.fit(X_train, Y).decision_function(X_test)
print ("y_score"+str(y_score))
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
#print accuracy_score
print ("accuracy : "+str(accuracy_score(Y_test, predicted)))
print ("micro f-measure "+str(f1_score(Y_test, predicted, average='weighted')))
print("precision"+str(precision_score(Y_test,predicted,average='weighted')))
print("recall"+str(recall_score(Y_test,predicted,average='weighted')))
for item, labels in zip(X_test, all_labels):
print ('%s => %s' % (item, ', '.join(labels)))
when I change the code to use k fold cross-validation instead of train_tes_split. I got this error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 6008]
Updated with iloc
my code to use k-fold cross validation looks like this:
kf = KFold(n_splits=10)
kf.get_n_splits(X)
KFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y_train_text.iloc[train_index],
y_train_text.iloc[test_index]
would you please let me know which part Im doing incorrectly?
my data looks like this:
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1.0,,,,,,
1,I am detoxing from Lexapro now.,,,,,,,1.0
2,I slowly cut my dosage over several months and took vitamin supplements to help.,,,,,,,1.0