Classification Problem: iterate different K for KNN - python

I am currently working on a classification problem (tweet sentiment analysis) and I would like to include a for loop for different K-values (KNN) in the classifiers list below.
I know that I could just go with:
KNeighborsClassifier(3), KNeighborsClassifier(5)... But I am trying to implement the rather elegant solution with a for loop.
Unfortunately, trying to create an empty list and add the different K values to it and then including it in the classifiers = [] list does not work properly. Do you have any good recommendations?
My code:
classifiers = [
KNeighborsClassifier(3),
LogisticRegression(),
SVC(kernel = "rbf", C = 0.025, probability = True),
NuSVC(probability = True),
DecisionTreeClassifier(),
RandomForestClassifier(),
GradientBoostingClassifier(),
MultinomialNB(),
BernoulliNB()]
for clf in classifiers:
clf.fit(train_tf, y_train)
name = clf.__class__.__name__
expectation = y_train
train_prediction = clf.predict(train_tf)
acc = accuracy_score(expectation, train_prediction)
pre = precision_score(expectation, train_prediction)
rec = recall_score(expectation, train_prediction)
f1 = f1_score(expectation, train_prediction)
fig, ax = plt.subplots(1,2, figsize=(14,4))
plt.suptitle(f'{name} \n', fontsize = 18)
plt.subplots_adjust(top = 0.8)
skplt.metrics.plot_confusion_matrix(expectation, train_prediction, ax=ax[0])
skplt.metrics.plot_confusion_matrix(expectation, train_prediction, normalize=True, ax = ax[1])
plt.show()
print(f"for the {name} we receive the following values:")
print("Accuracy: {:.3%}".format(acc))
print('Precision score: {:.3%}'.format(pre))
print('Recall score: {:.3%}'.format(rec))
print('F1 score: {:.3%}'.format(f1))
If you need any more info, just let me know :) Thank you very much in advance!

Something like this could work, where you basically just freeze the iteration of KNeighbors until all the neighbor values are exhausted.
classifiers = [
KNeighborsClassifier(3),
LogisticRegression(),
SVC(kernel = "rbf", C = 0.025, probability = True),
NuSVC(probability = True),
DecisionTreeClassifier(),
RandomForestClassifier(),
GradientBoostingClassifier(),
MultinomialNB(),
BernoulliNB()]
n_neighbors = [3, 5, 6, 10, 15] # or whatever
class_iter = iter(classifiers)
clf = next(class_iter)
while True:
try:
if isinstance(clf, KNeighborsClassifier) and any(n_neighbors):
neighbor_val = n_neighbors.pop()
clf.set_params(n_neighbors=neighbor_val)
else:
clf = next(class_iter)
#rest of code here
clf.fit(train_tf, y_train)
name = clf.__class__.__name__
expectation = y_train
train_prediction = clf.predict(train_tf)
acc = accuracy_score(expectation, train_prediction)
pre = precision_score(expectation, train_prediction)
rec = recall_score(expectation, train_prediction)
f1 = f1_score(expectation, train_prediction)
fig, ax = plt.subplots(1,2, figsize=(14,4))
plt.suptitle(f'{name} \n', fontsize = 18)
plt.subplots_adjust(top = 0.8)
skplt.metrics.plot_confusion_matrix(expectation, train_prediction, ax=ax[0])
skplt.metrics.plot_confusion_matrix(expectation, train_prediction, normalize=True, ax = ax[1])
plt.show()
print(f"for the {name} we receive the following values:")
print("Accuracy: {:.3%}".format(acc))
print('Precision score: {:.3%}'.format(pre))
print('Recall score: {:.3%}'.format(rec))
print('F1 score: {:.3%}'.format(f1))
except StopIteration:
break

Related

interactive plot python for DBSCAN clustering

I'm trying to build a code for interactive DBSCAN clustering method. But when I run it I get some errors. But when I'm getting an error says "NameError: name 'train_test_split' is not defined
even I tried to import it using : from sklearn.model_selection import train_test_split
How can I solve to have the interactive plot working proberly?
df_mv = pd.read_csv(r"https://raw.githubusercontent.com/HanaBachi/MachineLearning/main/multishape.csv")
text_trap = io.StringIO()
sys.stdout = text_trap
l = widgets.Text(value=' DBSCAN, Hana Bachi, The University of Texas at Austin',
layout=Layout(width='950px', height='30px'))
eps = widgets.FloatSlider(min=0, max = 2, value=0.1, step = 0.1, description = 'eps',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)
minPts = widgets.FloatSlider(min=0, max = 5, value=1, step = 1, description = 'minPts %',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)
color = ['blue','red','green','yellow','orange','white','magenta','cyan']
style = {'description_width': 'initial'}
ui = widgets.HBox([eps,minPts],)
ui2 = widgets.VBox([l,ui],)
# create activation function plots
def DBSCAN_plot(eps, minPts):
db = DBSCAN(eps=0.155, min_samples=5).fit(df_mv)
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
x = df_mv.values[:,0]
y = df_mv.values[:,1]
cmap = plt.cm.rainbow
#norm = mc.BoundaryNorm(labels, cmap.N)
plt.figure(figsize=(14,7))
plt.scatter(x, y, c=labels, cmap='tab10', s=50)
plt.scatter(x[np.where(labels==-1)], y[np.where(labels==-1)], c='k', marker='x', s=100)
plt.title('DBSCAN of non-spherical data with noise', fontsize = 20)
plt.colorbar()
plt.show()
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.0, wspace=0.2, hspace=0.3)
plt.show()
interactive_plot1 = widgets.interactive_output(DBSCAN_plot, {'eps': eps})
interactive_plot1 = widgets.interactive_output(DBSCAN_plot, {'minPts': minPts})
interactive_plot1.clear_output(wait = True) # reduce flickering by delaying plot updating
# create dashboard/formatting
uia = widgets.HBox([interactive_plot1],)
uia2 = widgets.VBox([eps, uia],)
uib = widgets.HBox([interactive_plot1],)
uib2 = widgets.VBox([minPts, uib],)
display(uib2, interactive_plot) # display the interactive plot
I'm trying to build an interactive plot for DBSCAN clustering

How to perform grid search for Multiple ML Models

Normally we use GridSearchCV for performing grid search on hyperparameters of one particular model, like for example:
model_ada = AdaBoostClassifier()
params_ada = {'n_estimators':[10,20,30,50,100,500,1000], 'learning_rate':[0.5,1,2,5,10]}
grid_ada = GridSearchCV(estimator = model_ada, param_grid = params_ada, scoring = 'accuracy', cv = 5, verbose = 1, n_jobs = -1)
grid_ada.fit(X_train, y_train)
Is there any technique or function which allows us to perform grid search on ML models themselves? For example, I want to do as given below:
models = {'model_gbm':GradientBoostingClassifier(), 'model_rf':RandomForestClassifier(), 'model_dt':DecisionTreeClassifier(), 'model_svm':SVC(), 'model_ada':AdaBoostClassifier()}
params_gbm = {'learning_rate':[0.1,0.2,0.3,0.4], 'n_estimators':[50,100,500,1000,2000]}
params_rf = {'n_estimators':[50,100,500,1000,2000]}
params_dt = {'splitter':['best','random'], 'max_depth':[1, 5, 10, 50, 100]}
params_svm = {'C':[1,2,5,10,50,100,500], 'kernel':['rbf','poly','sigmoid','linear']}
params_ada = {'n_estimators':[10,20,30,50,100,500,1000], 'learning_rate':[0.5,1,2,5,10]}
params = {'params_gbm':params_gbm, 'params_rf':params_rf, 'params_dt':params_dt, 'params_svm':params_svm, 'params_ada':params_ada}
grid_ml = "that function"(models = models, params = params)
grid_ml.fit(X_train, y_train)
where "that function" is the function which I need to use to perform this type of operation.
Even I faced a similar issue, but couldn't find a predefined package/method that could possibly achieve this. Hence I wrote my own function to achieve this :
def Algo_search(models , params):
max_score = 0
max_model = None
max_model_params = None
for i,j in zip(models.keys() , models.values() ):
gs = GridSearchCV(estimator=j,param_grid=params[i])
a = gs.fit(X_train,y_train)
score = gs.score(X_test,y_test)
if score > max_score:
max_score = score
max_model = gs.best_estimator_
max_model_params = gs.best_params_
return max_score, max_model, max_model_params
#Data points
models = {'model_gbm':GradientBoostingClassifier(), 'model_rf':RandomForestClassifier(),
'model_dt':DecisionTreeClassifier(), 'model_svm':SVC(), 'model_ada':AdaBoostClassifier()}
params_gbm = {'learning_rate':[0.1,0.2,0.3,0.4], 'n_estimators':[50,100,500,1000,2000]}
params_rf = {'n_estimators':[50,100,500,1000,2000]}
params_dt = {'splitter':['best','random'], 'max_depth':[1, 5, 10, 50, 100]}
params_svm = {'C':[1,2,5,10,50,100,500], 'kernel':['rbf','poly','sigmoid','linear']}
params_ada = {'n_estimators':[10,20,30,50,100,500,1000], 'learning_rate':[0.5,1,2,5,10]}
params = {'model_gbm':params_gbm, 'model_rf':params_rf, 'model_dt':params_dt, 'model_svm':params_svm, 'model_ada':params_ada}
grid_ml = Algo_search(models = models, params = params)
It should be straightforward to perform multiple GridSearchCV then compare the results.
Below is a complete example on how to achieve this.
Note that there is a room for improvement, I will leave it to you. However, this is just to give you some insights of the idea.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier, \
RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
def get_param(model_name, params):
"""
Not the most sufficient way.
I recommend to have params and models
in OrderedDict() instead.
"""
for k, v in params.items():
mn = str(model_name).upper().split('_')
for k_ in str(k).upper().split('_'):
if k_ in mn:
return v
def models_gridSearchCV(models, params, scorer, X, y):
all_results = dict.fromkeys(models.keys(), [])
best_model = {'model_name': None,
'best_estimator': None,
'best_params': None,
'best_score': -9999999}
for model_name, model in models.items():
print("Processing {} ...".format(model_name))
# or use OrderedDict() and zip(models, params) above
# so there will be no need to check
param = get_param(model_name, params)
if param is None:
continue
clf = GridSearchCV(model, param, scoring=scorer)
clf.fit(X, y)
all_results[model_name] = clf.cv_results_
if clf.best_score_ > best_model.get('best_score'):
best_model['model_name'] = model_name
best_model['best_estimator'] = clf.best_estimator_
best_model['best_params'] = clf.best_params_
best_model['best_score'] = clf.best_score_
return best_model, all_results
### TEST ###
iris = datasets.load_iris()
X, y = iris.data, iris.target
# OrderedDict() is recommended here
# to maintain order between models and params
models = {'model_gbm': GradientBoostingClassifier(),
'model_rf': RandomForestClassifier(),
'model_dt': DecisionTreeClassifier(),
'model_svm': SVC(),
'model_ada': AdaBoostClassifier()}
params_gbm = {'learning_rate': [0.1, 0.2], 'n_estimators': [50, 100]}
params_rf = {'n_estimators': [50, 100]}
params_dt = {'splitter': ['best', 'random'], 'max_depth': [1, 5]}
params_svm = {'C': [1, 2, 5], 'kernel': ['rbf', 'linear']}
params_ada = {'n_estimators': [10, 100], 'learning_rate': [0.5, 1]}
# OrderedDict() is recommended here
# to maintain order between models and params
params = {'params_gbm': params_gbm,
'params_rf': params_rf,
'params_dt': params_dt,
'params_svm': params_svm,
'params_ada': params_ada}
best_model, all_results = models_gridSearchCV(models, params, 'accuracy', X, y)
print(best_model)
# print(all_results)
Result
Processing model_gbm ...
Processing model_rf ...
Processing model_dt ...
Processing model_svm ...
Processing model_ada ...
{'model_name': 'model_svm', 'best_estimator': SVC(C=5),
'best_params': {'C': 5, 'kernel': 'rbf'}, 'best_score': 0.9866666666666667}

How can I calculate the coherence score in the sklearn implementation of NMF?

I'm trying to build a utility where a dataset will be processed by the NMF model every couple of days. For this in the first run, I'm providing with a starting value for the number of topics. How can I calculate the coherence score for this entire dataset? I'm planning to use this calculated score to rebuild the model so that it'll be more accurate. Below is the code that I've used.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import pandas as pd
import clr
#PLOTTING TOOLS
# import matplotlib.pyplot as PLOTTING
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)
dataset = pd.read_json('out.json', lines = True)
documents = dataset['attachment']
no_features = 1000
no_topics = 9
# print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english', norm='l2')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
retrain_value = no_topics[0]
# print('New number of topics :', retrain_value)
nmf = NMF(n_components = retrain_value, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print ("Topic %d: " % (topic_idx))
print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words -1:-1]]))
no_top_words = 20
display_topics(nmf, tfidf_feature_names, no_top_words)
Unfortunately there is no out-of-the-box coherence model for sklearn.decomposition.NMF.
I've had the very same issue and found a custom implementation that is working with python 3.8.
It should be easy to adapt to your code. Please check the link for full imports, etc.
A snipptet from my recent usage of this technique:
kmin, kmax = 2, 30
topic_models = []
# try each value of k
for k in range(kmin,kmax+1):
print("Applying NMF for k=%d ..." % k )
# run NMF
model = decomposition.NMF( init="nndsvd", n_components=k )
W = model.fit_transform( A )
H = model.components_
# store for later
topic_models.append( (k,W,H) )
class TokenGenerator:
def __init__( self, documents, stopwords ):
self.documents = documents
self.stopwords = stopwords
self.tokenizer = re.compile( r"(?u)\b\w\w+\b" )
def __iter__( self ):
print("Building Word2Vec model ...")
for doc in self.documents:
tokens = []
for tok in self.tokenizer.findall( doc ):
if tok.lower() in self.stopwords:
tokens.append( "<stopword>" )
elif len(tok) >= 2:
tokens.append( tok.lower() )
yield tokens
docgen = TokenGenerator(docs_raw, stop_words)
w2v_model = gensim.models.Word2Vec(docgen, size=500, min_count=20, sg=1)
def calculate_coherence( w2v_model, term_rankings ):
overall_coherence = 0.0
for topic_index in range(len(term_rankings)):
# check each pair of terms
pair_scores = []
for pair in combinations( term_rankings[topic_index], 2 ):
#print(str(pair[0]) + " " + str(pair[1]))
pair_scores.append( w2v_model.similarity(pair[0], pair[1]))
# get the mean for all pairs in this topic
topic_score = sum(pair_scores) / len(pair_scores)
overall_coherence += topic_score
# get the mean score across all topics
return overall_coherence / len(term_rankings)
def get_descriptor( all_terms, H, topic_index, top ):
# reverse sort the values to sort the indices
top_indices = np.argsort( H[topic_index,:] )[::-1]
# now get the terms corresponding to the top-ranked indices
top_terms = []
for term_index in top_indices[0:top]:
top_terms.append( all_terms[term_index] )
return top_terms
k_values = []
coherences = []
for (k,W,H) in topic_models:
# Get all of the topic descriptors - the term_rankings, based on top 10 terms
term_rankings = []
for topic_index in range(k):
term_rankings.append( get_descriptor( terms, H, topic_index, 10 ) )
# Now calculate the coherence based on our Word2vec model
k_values.append( k )
coherences.append( calculate_coherence( w2v_model, term_rankings ) )
print("K=%02d: Coherence=%.4f" % ( k, coherences[-1] ) )
%matplotlib inline
plt.style.use("ggplot")
matplotlib.rcParams.update({"font.size": 14})
fig = plt.figure(figsize=(13,7))
# create the line plot
ax = plt.plot( k_values, coherences )
plt.xticks(k_values)
plt.xlabel("Number of Topics")
plt.ylabel("Mean Coherence")
# add the points
plt.scatter( k_values, coherences, s=120)
# find and annotate the maximum point on the plot
ymax = max(coherences)
xpos = coherences.index(ymax)
best_k = k_values[xpos]
plt.annotate( "k=%d" % best_k, xy=(best_k, ymax), xytext=(best_k, ymax), textcoords="offset points", fontsize=16)
# show the plot
plt.show()
Results:
K=02: Coherence=0.4157
K=03: Coherence=0.4399
K=04: Coherence=0.4626
K=05: Coherence=0.4333
K=06: Coherence=0.4075
K=07: Coherence=0.4121
...

LightGBM vs Sklearn LightGBM- Mistake in Implementation- Exact same parameters giving different results

While passing the exact same parameters to LightGBM and sklearn's implementation of LightGBM, I am getting different results. Initially, I was getting the exact same results on doing this, however, I made some changes to my code and now I can't find out why they're not coming the same. This means that the performance metrics and feature importance are coming differently. Please help me figure it out, I can't figure out the mistake I am making. It could either be a mistake in the way I am implementing LightGBM using the original library or in sklearn's implementation. Link for explanation on why we should get identical results - light gbm - python API vs Scikit-learn API
x_train, x_test, y_train, y_test = train_test_split(df_dummy[df_merge.columns], labels, test_size=0.25,random_state=42)
n_folds = 5
lgb_train = lgb.Dataset(x_train, y_train)
def objective(params, n_folds = n_folds):
"""Objective function for Gradient Boosting Machine Hyperparameter Tuning"""
print(params)
params['max_depth'] = int(params['max_depth'])
params['num_leaves'] = int(params['num_leaves'])
params['min_child_samples'] = int(params['min_child_samples'])
params['subsample_freq'] = int(params['subsample_freq'])
# Perform n_fold cross validation with hyperparameters
# Use early stopping and evalute based on ROC AUC
cv_results = lgb.cv(params, lgb_train, nfold=n_folds, num_boost_round=10000,
early_stopping_rounds=100, metrics='auc')
# Extract the best score
best_score = max(cv_results['auc-mean'])
# Loss must be minimized
loss = 1 - best_score
num_iteration = int(np.argmax(cv_results['auc-mean']) + 1)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([loss, params, num_iteration])
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'status': STATUS_OK, 'estimators': num_iteration}
space = {
'min_child_samples': hp.quniform('min_child_samples', 5, 100, 5),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
'max_depth' : hp.quniform('max_depth', 3, 10, 1),
'subsample' : hp.quniform('subsample', 0.6, 1, 0.05),
'num_leaves': hp.quniform('num_leaves', 20, 150, 1),
'subsample_freq': hp.quniform('subsample_freq',0,10,1),
'min_gain_to_split': hp.quniform('min_gain_to_split', 0.01, 0.1, 0.01),
'learning_rate' : 0.05,
'objective' : 'binary',
}
out_file = 'results/gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'estimators'])
of_connection.close()
trials = Trials()
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=10)
bayes_trials_results = sorted(trials.results, key = lambda x: x['loss'])
results = pd.read_csv('results/gbm_trials.csv')
# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()
best_bayes_estimators = int(results.loc[0, 'estimators'])
best['max_depth'] = int(best['max_depth'])
best['num_leaves'] = int(best['num_leaves'])
best['min_child_samples'] = int(best['min_child_samples'])
num_boost_round=int(best_bayes_estimators * 1.1)
best['objective'] = 'binary'
best['boosting_type'] = 'gbdt'
best['subsample_freq'] = int(best['subsample_freq'])
#Actual LightGBM
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)
print('Plotting feature importances...')
ax = lgb.plot_importance(best_gbm, max_num_features=15)
plt.show()
feature_imp = pd.DataFrame()
feature_imp["feature"] = list(x_train.columns)
feature_imp["importance_gain"] = best_gbm.feature_importance(importance_type='gain')
feature_imp["importance_split"] = best_gbm.feature_importance(importance_type='split')
feature_imp.to_clipboard()
y_pred_score = best_gbm.predict(x_test)
roc_auc_score_list = []
f1_score_list = []
accuracy_score_list = []
precision_score_list = []
recall_score_list = []
thresholds = [0.4,0.5,0.6,0.7]
for threshold in thresholds:
print("threshold is {}".format(threshold))
y_pred = np.where(y_pred_score>=threshold, 1, 0)
print(roc_auc_score(y_test,y_pred_score))
print(f1_score(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
roc_auc_score_list.append(roc_auc_score(y_test,y_pred_score))
f1_score_list.append(f1_score(y_test,y_pred))
accuracy_score_list.append(accuracy_score(y_test,y_pred))
precision_score_list.append(precision_score(y_test,y_pred))
recall_score_list.append(recall_score(y_test,y_pred))
performance_metrics = pd.DataFrame(
{'thresholds':thresholds,
'roc_auc_score':roc_auc_score_list,
'f1_score':f1_score_list,
'accuracy_score':accuracy_score_list,
'precision_score':precision_score_list,
'recall_score':recall_score_list })
performance_metrics.transpose().to_clipboard()
#Sklearn's Implementation of LightGBM
best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, learning_rate=0.05, min_split_gain=best['min_gain_to_split'])
sk_best_gbm.fit(x_train, y_train)
sk_best_gbm.get_params()
print('Plotting feature importances...')
ax = lgb.plot_importance(sk_best_gbm, max_num_features=15)
plt.show()
feature_imp = pd.DataFrame()
feature_imp["feature"] = list(x_train.columns)
feature_imp["Importance"] = sk_best_gbm.feature_importances_
feature_imp.to_clipboard()
y_pred_score = sk_best_gbm.predict_proba(x_test)[:,1]
roc_auc_score_list = []
f1_score_list = []
accuracy_score_list = []
precision_score_list = []
recall_score_list = []
thresholds = [0.4,0.5,0.6,0.7]
for threshold in thresholds:
print("threshold is {}".format(threshold))
y_pred = np.where(y_pred_score>=threshold, 1, 0)
print(roc_auc_score(y_test,y_pred_score))
print(f1_score(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
roc_auc_score_list.append(roc_auc_score(y_test,y_pred_score))
f1_score_list.append(f1_score(y_test,y_pred))
accuracy_score_list.append(accuracy_score(y_test,y_pred))
precision_score_list.append(precision_score(y_test,y_pred))
recall_score_list.append(recall_score(y_test,y_pred))
performance_metrics = pd.DataFrame(
{'thresholds':thresholds,
'roc_auc_score':roc_auc_score_list,
'f1_score':f1_score_list,
'accuracy_score':accuracy_score_list,
'precision_score':precision_score_list,
'recall_score':recall_score_list })
performance_metrics.transpose().to_clipboard()

PySpark - How to get precision / recall / ROC from TrainValidationSplit?

My current approach to evaluate different parameters for LinearSVC and get the best one:
tokenizer = Tokenizer(inputCol="Text", outputCol="words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
LSVC = LinearSVC()
rescaledData = idfModel.transform(featurizedData)
paramGrid = ParamGridBuilder()\
.addGrid(LSVC.maxIter, [1])\
.addGrid(LSVC.regParam, [0.001, 10.0])\
.build()
crossval = TrainValidationSplit(estimator=LSVC,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(metricName="weightedPrecision"),
testRatio=0.01)
cvModel = crossval.fit(rescaledData.select("KA", "features").selectExpr("KA as label", "features as features"))
bestModel = cvModel.bestModel
Now I would like to get the basic parameters of ML (like precision, recall etc.), how do I get those?
You can try this
from pyspark.mllib.evaluation import MulticlassMetrics
# Instantiate metrics object
metrics = MulticlassMetrics(predictionAndLabels)
# Overall statistics
precision = metrics.precision()
recall = metrics.recall()
f1Score = metrics.fMeasure()
print("Summary Stats")
print("Precision = %s" % precision)
print("Recall = %s" % recall)
print("F1 Score = %s" % f1Score)
You can check this link for further info
https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html

Categories

Resources