I am testing an SVM with a sigmoid kernel on the iris data using sklearn and SVC. Its performance is extremely poor with an accuracy of 25 %. I'm using exactly the same code and normalizing the features as https://towardsdatascience.com/a-guide-to-svm-parameter-tuning-8bfe6b8a452c (sigmoid section) which should increase performance substantially. However, I am not able to reproduce his results and the accuracy only increases to 33 %.
Using other kernels (e.g linear kernel) produces good results (accuracy of 82 %).
Could there be an issue within the SVC(kernel = 'sigmoid') function?
Python code to reproduce problem:
##sigmoid iris example
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.svm import SVC
sepal_length = iris.data[:,0]
sepal_width = iris.data[:,1]
#assessing performance of sigmoid SVM
clf = SVC(kernel='sigmoid')
clf.fit(np.c_[sepal_length, sepal_width], iris.target)
pr=clf.predict(np.c_[sepal_length, sepal_width])
pd.DataFrame(classification_report(iris.target, pr, output_dict=True))
from sklearn.metrics.pairwise import sigmoid_kernel
sigmoid_kernel(np.c_[sepal_length, sepal_width])
#normalizing features
from sklearn.preprocessing import normalize
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
sigmoid_kernel(np.c_[sepal_length_norm, sepal_width_norm])
#assessing perfomance of sigmoid SVM with normalized features
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True))
I see what's happening. In sklearn releases pre 0.22 the default gamma parameter passed to the SVC was "auto", and in subsequent releases this was changed to "scale". The author of the article seems to have been using a previous version and therefore implicitly passing gamma="auto" (he mentions that the "current default setting for gamma is ‘auto’"). So if you're on the latest release of sklearn (0.23.2), you'll want to explicitly pass gamma='auto' when instantiating the SVC:
clf = SVC(kernel='sigmoid',gamma='auto')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
So now when you print the classification report:
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.907407 0.650000 0.750000 0.766667 0.769136 0.769136
# recall 0.980000 0.780000 0.540000 0.766667 0.766667 0.766667
# f1-score 0.942308 0.709091 0.627907 0.766667 0.759769 0.759769
# support 50.000000 50.000000 50.000000 0.766667 150.000000 150.000000
What would explain the 33% accuracy you were seeing is the fact that the default gamma is "scale", which then places all predictions in a single region of the decision plane, and as the targets are split into thirds you get a maximum accuracy of 33.3%:
clf = SVC(kernel='sigmoid')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
X = np.c_[sepal_length_norm, sepal_width_norm]
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.0 0.0 0.333333 0.333333 0.111111 0.111111
# recall 0.0 0.0 1.000000 0.333333 0.333333 0.333333
# f1-score 0.0 0.0 0.500000 0.333333 0.166667 0.166667
# support 50.0 50.0 50.000000 0.333333 150.000000 150.000000
Related
Newbie to ML.
I'm having trouble understanding a classification report for a RandomForest model I'm running.
I've cleaned my data and realised there's data imbalance between my target labels (0: 182588, 1: 1137) - essentially 99/1.
As I understand, I need to either oversample or undersample my data to improve the predictive modelling of the RandomForest model.
I've ran both, I've also used TargetEncoding to convert my categorical data into numerical data (there are over 40 columns after pre-processing, so I felt this was the best approach - at least better than one hot encoding which will result in too much noise).
I then ran my model using both over and under sampling. With over, it returned an accuracy of 99%, with under, accuracy of 84%.
A 99% accuracy seems unrealistic and overfitting, 84% makes more sense. Nonetheless, what's more important to me is accuracy/recall.
How do I interpret the Classification Report - what is 'macro avg' and should I be looking at that to see how accurate my model is?
Here's a snippet of my code, both with over/under sampling. Maybe I'm doing something wrong and just can't see it.
label = mod_df.iloc[:, -1]
predictors = mod_df.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(predictors, label, test_size = 0.4, random_state = 1, stratify = mod_df.iloc[:, -1])
oversample = RandomOverSampler(sampling_strategy = 'minority', random_state = 1) # 109553 variables each
X_train, y_train = oversample.fit_resample(X_train, y_train)
encoder = TargetEncoder()
X_train = encoder.fit_transform(X_train, y_train)
X_test = encoder.transform(X_test)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
accuracy_score(y_test, pred) # 0.9954551639678868
print(classification_report(y_test, pred))
precision recall f1-score support
0 1.00 1.00 1.00 73035
1 0.88 0.31 0.46 455
accuracy 1.00 73490
macro avg 0.94 0.65 0.73 73490
weighted avg 0.99 1.00 0.99 73490
With undersampling, same process except the training set contains only 682 variables each.
precision recall f1-score support
0 1.00 0.85 0.92 73035
1 0.04 0.85 0.07 455
accuracy 0.85 73490
macro avg 0.52 0.85 0.49 73490
weighted avg 0.99 0.85 0.92 73490
Snippet of the data I'm working with...
population county_location primary_road secondary_road distance direction weather_1 party_count pcf_violation_category hit_and_run road_surface road_condition_1 lighting control_device bicycle_collision motorcycle_collision truck_collision collision_time party_sex party_age party_sobriety party_safety_equipment_1 party_safety_equipment_2 cellphone_in_use other_associate_factor_1 movement_preceding_collision vehicle_year party_race month day collision_severity
3 100000 to 250000 ventura W KENTWOOD DR H ST 100.0 east clear 3 dui not hit and run wet normal dark with street lights functioning 0 0 0 22:36:00 male 27 had been drinking, under influence air bag not deployed lap/shoulder harness used 0 violation proceeding straight 1989 hispanic 10 05 0
6 >250000 los angeles IMPERIAL HWY MAIN ST 33.0 east clear 2 speeding not hit and run dry normal dark with street lights functioning 0 0 1 21:25:00 male 61 had not been drinking air bag not deployed lap/shoulder harness used 0 none apparent passing other vehicle 2006 black 08 06 0
7 >250000 los angeles IMPERIAL HWY MAIN ST 33.0 east clear 2 speeding not hit and run dry normal dark with street lights functioning 0 0 1 21:25:00 male 33 had not been drinking air bag not deployed lap/shoulder harness used 0 none apparent proceeding straight 2013 hispanic 08 06 0
I'm trying to tune the hyperparameters of MLP classifier using GridSearchCV but facing the following issue:
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.
Details:
ValueError: learning rate 0.01 is not supported.
FitFailedWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.
Details:
ValueError: learning rate 0.02 is not supported
........
Code:
clf = MLPClassifier()
params= {
'hidden_layer_sizes': hidden_layers_generator(X,np.arange(1,17,1)),
'solver': ['sgd'],
'momentum': np.arange(0.1,1.1,0.1),
'learning_rate': np.arange(0.01,1.01,0.01),
'max_iter': np.arange(100,2100,100)}
grid = GridSearchCV(clf, params, cv=10, scoring='accuracy')
grid.fit(X, y)
grid_mean_scores = grid.cv_results_['mean_test_score']
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
The code of hidden_layers_generator is as follows:
from itertools import combinations_with_replacement
def hidden_layers_generator(df,hidden_layers):
hd_sizes = []
for l in range(1, len(hidden_layers)):
comb = combinations_with_replacement(np.arange(1,len(df.columns),10), l)
hd_sizes.append(list(comb))
return hd_sizes
Here's a small snippet of X and y dataframes:
X.head()
sl sw pl pw
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
y.head()
0 0
1 1
2 1
3 0
4 0
If you look at the documentation of MLPClassifier, you will see that learning_rate parameter is not what you think but instead, it is a kind of scheduler. What you want is learning_rate_init parameter. So change this line in the configuration:
'learning_rate': np.arange(0.01,1.01,0.01),
to
'learning_rate_init': np.arange(0.01,1.01,0.01),
New to python, building a classifier that predicts likelihood of vaccination if trust in government (trustingov) and trust in public health (poptrusthealth) from the dataset is greater than a certain percentage. Not sure how to get both as classes.
UPDATE: Concatenated the dataframe values, but why is the accuracy of the model 1.0?
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
df = pd.read_csv("covidpopulation2.csv")
print(df.head())
99853 8254 219 0.649999976 0.80763793
0 99853 8254 219 0.65 0.807638
1 48490 4007 227 0.49 0.357625
2 190179 8927 107 0.54 0.853186
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
99853 8254 219 0.649999976 0.80763793
count 1.342500e+04 13425.000000 13425.000000 13425.000000 13425.000000
mean 3.095292e+05 20555.570056 225.864655 0.473157 0.684484
std 5.070872e+05 28547.608184 218.078176 0.184501 0.167985
min 1.225700e+04 26.000000 2.000000 0.000000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.370000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.490000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.630000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.770000 0.983146
df = pd.read_csv("covidpopulation2.csv", na_values = ['?'], names = ['covidcases','coviddeaths','mortalityperm','trustngov','poptrusthealth'])
print(df.head())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
0 99853 8254 219 0.65 0.807638
1 99853 8254 219 0.65 0.807638
2 48490 4007 227 0.49 0.357625
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
df.dropna(inplace=True)
In [212]:
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
all_features = df[['covidcases',
'coviddeaths',
'mortalityperm',
'trustngov',
'poptrusthealth',]].values
all_classes = (df['poptrusthealth'].values + df['trustngov'].values)
willing = 0
unwilling = 0
label = [None] * 13426
for i in range (len(all_classes)):
if all_classes[i] > 0.70:
willing += 1
label[i] = 1
else:
unwilling = unwilling + 1
label[i] = 0
print(willing)
print(unwilling)
all_classes = label
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
from sklearn.model_selection import train_test_split
np.random.seed(1234)
(training_inputs,testing_inputs,training_classes,testing_classes) = train_test_split(all_features_scaled,all_classes,train_size = 0.8,test_size = 0.2,random_state = 1)
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(random_state=1)
clf.fit(training_inputs, training_classes)
DecisionTreeClassifier(random_state=1)
print(clf)
DecisionTreeClassifier(random_state=1)
print('the accuracy of the decision tree is:',clf.score(testing_inputs, testing_classes))
the accuracy of the decision tree is: 1.0
import pydotplus
from sklearn import tree
import collections
import graphviz
feature_names = ['covidcases','coviddeaths', 'mortalityperm','trustngov',
'poptrusthealth']
dot_data = tree.export_graphviz(clf, feature_names = feature_names, out_file =None, filled = True, rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('turquoise','orange')
edges = collections.defaultdict(list)
for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))
for edge in edges:
edges[edge].sort()
for i in range (2):
dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('tree.png')
Any help or ideas would be appreciated.
Sorry, but this makes no sense from a machine learning point of view. Your label is directly created from the input features. That's why the model accuracy is 100%.
Here is your final classifier (without needing any machine learning):
if trustingov + poptrusthealth > 0.7 predict 1, otherwise predict 0.
It is perfectly possible to have 100% accuracy with training data, as the ML algorithm know them.
You have to apply your ML to data not used during the learning phase. It is usually done by splitting data into a training data set and a test data set.
Then you train/fit the ML with train data only. Then test it and calculate accuracy on test data. The test data result/Accuracy will tell you if your ML is well trained and working.
Unused test data is important to do a good ML test. So you will find unbiased accuracy of it.
I performed a PCA of my data. The data looks like the following:
df
Out[60]:
Drd1_exp1 Drd1_exp2 Drd1_exp3 ... M7_pppp M7_puuu Brain_Region
0 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
3 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
4 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
... ... ... ... ... ... ...
150475 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
150478 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
150479 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
I know used every row until 'Brain Regions' as features. I also standardized them.
These features are different experiments, that give me information about a 3D image of a brain.
I'll show you my code:
from sklearn.preprocessing import StandardScaler
x = df.loc[:, listend1].values
y= df.loc[:, 'Brain_Region'].values
x = StandardScaler().fit_transform(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, df[['Brain_Region']]], axis = 1)
I then plotted finalDF:
My question now is: How can I find out, which features contribute to my Components? How can I find out, to interpret the data?
You can use pca.components_ (or pca.components depending on the sklearn version).
It has shape (n_components, n_features), in your case (2, n_features) and represents the directions of maximum variance in the data, which reflects the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance). You will have something like this:
[[0.522 0.26 0.58 0.56],
[0.37 0.92 0.02 0.06]]
implying that for the first component (first row) the first, third and last features have an higher importance, while for the second component only the second feature is important.
Have a look to sklern PCA attributes description or to this post.
By the way, you can also use a Random Forest Classifier including the labels, and after the training you can explore the feature importance, e.g. this post.
I imported classification_report from sklearn.metrics and when I enter my np.arrays as parameters I get the following error :
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1135:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. 'precision',
'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1137:
UndefinedMetricWarning: Recall and F-score are ill-defined and being
set to 0.0 in labels with no true samples. 'recall', 'true',
average, warn_for)
Here is the code :
svclassifier_polynomial = SVC(kernel = 'poly', degree = 7, C = 5)
svclassifier_polynomial.fit(X_train, y_train)
y_pred = svclassifier_polynomial.predict(X_test)
poly = classification_report(y_test, y_pred)
When I was not using np.array in the past it worked just fine, any ideas on how i can correct this ?
This is not an error, just a warning that not all your labels are included in your y_pred, i.e. there are some labels in your y_test that your classifier never predicts.
Here is a simple reproducible example:
from sklearn.metrics import precision_score, f1_score, classification_report
y_true = [0, 1, 2, 0, 1, 2] # 3-class problem
y_pred = [0, 0, 1, 0, 0, 1] # we never predict '2'
precision_score(y_true, y_pred, average='macro')
[...] UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
0.16666666666666666
precision_score(y_true, y_pred, average='micro') # no warning
0.3333333333333333
precision_score(y_true, y_pred, average=None)
[...] UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
array([0.5, 0. , 0. ])
Exact same warnings are produced for f1_score (not shown).
Practically this only warns you that in the classification_report, the respective values for labels with no predicted samples (here 2) will be set to 0:
print(classification_report(y_true, y_pred))
precision recall f1-score support
0 0.50 1.00 0.67 2
1 0.00 0.00 0.00 2
2 0.00 0.00 0.00 2
micro avg 0.33 0.33 0.33 6
macro avg 0.17 0.33 0.22 6
weighted avg 0.17 0.33 0.22 6
[...] UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
When I was not using np.array in the past it worked just fine
Highly doubtful, since in the example above I have used simple Python lists, and not Numpy arrays...
It means that some labels are only present in train data and some labels are only present in test dataset. Run the following codes, to understand the distribution of train and test labels.
from collections import Counter
Counter(y_train)
Counter(y_test)
Use stratified train_test_split to get rid of the situation where some labels are present only in test dataset.
It might have worked in past simply because of the random splitting of dataset. Hence, stratified splitting is always recommended.
The first situation is more about model fine tuning or choice of model.