I have a supervised learning classification problem. I have 4 numeric class labels (0, 1, 2, 3) and I have about 100 trials of 38 separate features as the input.
After inputting this data into an SVC classifier in Python and Matlab (specifically the Classification Learner App), and matching the hyperparameters (C = 1, type = quadratic SVM, multi-class_method = onevsone, standardised data, no PCA), the accuracies given vary drastically:
Matlab = 86.7 %
Python = 45.0 %
Has anyone come across this or have any other ideas what I could do to know which one is correct?
Matlab input:
Python input:
import numpy as np
from sklearn import datasets, linear_model, metrics, svm, preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
symptom = input("What symptom would you like to analyse? \n")
cross_validation = input("With cross validation? \n")
if cross_validation == "Yes":
no_cvfolds = np.int(input("Number of folds? \n"))
x = symptomDF[feature]
y = symptomDF.loc[:, 'updrs_class'].values
x_new = StandardScaler().fit_transform(x)
scores = cross_val_score(SVC(kernel='poly', degree = 2, C=1.0, decision_function_shape = 'ovo'), x_new, y, cv = no_cvfolds)
print(name + " Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
So after a few days, one factor that seemed to have helped provide similar accuracies to Matlab, was the cross validator used and the shuffling parameter in the model.
Instead of directly stating the number of folds and inputting that into the cross_val_score function, I defined a Stratified K-folds cross validator. This was better for me because it provided a balance during cross validation when I had an imbalance in class data sizes. I then explicitly defined shuffle = True to shuffle the data before the cross validator split them into batches.
Related
I'm trying to apply ML on atomic structures using descriptors. My problem is that I get very different score values depending on the datasize I use, I suspect that something is wrong with my model, any suggestions would be appreciated. I used dataset from this paper (Dataset MoS2(single)).
Here is the my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ase
from dscribe.descriptors import SOAP
from dscribe.descriptors import CoulombMatrix
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from ase.io import read
materials = read('structures.xyz', index=':')
materials = materials[:5000]
energies = pd.read_csv('Energy.csv')
energies = np.array(energies['b'])
energies = energies[:5000]
species = ["H", 'Mo', 'S']
rcut = 8.0
nmax = 1
lmax = 1
# Setting up the SOAP descriptor
soap = SOAP(
species=species,
periodic=False,
rcut=rcut,
nmax=nmax,
lmax=lmax,
)
coulomb_matrices = soap.create(materials, positions=[[51]]*len(materials))
nsamples, nx, ny = coulomb_matrices.shape
d2_train_dataset = coulomb_matrices.reshape((nsamples,nx*ny))
df = pd.DataFrame(d2_train_dataset)
df['target'] = energies
from sklearn.preprocessing import StandardScaler
X = df.iloc[:, 0:12].values
y = df.iloc[:, 12:].values
st_x = StandardScaler()
st_y = StandardScaler()
X = st_x.fit_transform(X)
y = st_y.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#krr = GridSearchCV(
# KernelRidge(kernel="rbf", gamma=0.1),
# param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3], "gamma": np.logspace(-2, 2, 5)},
#)
svr = GridSearchCV(
SVR(kernel="rbf", gamma=0.1),
param_grid={"C": [1e0, 1e1, 1e2, 1e3], "gamma": np.logspace(-2, 2, 5)},
)
svr = svr.fit(X_train, y_train.ravel())
print("Training set score: {:.4f}".format(svr.score(X_train, y_train)))
print("Test set score: {:.4f}".format(svr.score(X_test, y_test)))
and score:
Training set score: 0.0414
Test set score: 0.9126
I don't have a full answer to your problem as recreating it would be very cumbersome, but here are some questions to check:
a) You are training on 5 CrossValidation folds (default). First you should check the results of all parameter combinations right after the fitting process with "svr.best_score_" (or more detailed with "svr.cv_results_dict") and see what mean score your folds actually produced. If the score is really is as low as 0.04 (I assume higher is better, which these scores usually do), taking the reciprocal prediction would actually be really accurate! If you know you're always wrong, it's really easy to be right. ;D
b) You could go ahead and just use the svr.best_params_ in order to train again on the whole X_train-set instead of the folds (this can also be achieved with the "refit"-option of RandomSearchCV as well) and then check with the test set again. Here could also be the actual error: The documentation for the score method of GridSearchCV reads: "Return the score on the given data, if the estimator has been refit." This is not the case in your gridsearch! Try turning the refit option on. Maybe that works? ... sorry, your code was too cumbersome to be replicated fast, so I didn't check myself ...
I am using Random Forest for my Regression problem in python. I have rather large data (5 features, 1 target, with 9387 dataset).
At first, to obtain the accuracy, I used a simple RF code with train_test_split and metrics.r2_score and the result gave me a 0.9999 score on both train set and test set. Later, I tried to perform cross-validation using cross_val_score with 5 Folds. This gives me 5 numbers (see below) which I found some of them are weird to be the score of cross-validation.
[-1.44202691 0.25338018 0.70433516 0.98278159 -3.34943088]
Is it really possible to have a negative accuracy or there is something just wrong with how I code?
I am still new to coding and python so please bear with me. You can see my code below.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import csv
from sklearn.model_selection import cross_val_score, train_test_split
from statistics import mean
data = pd.read_csv("Size1.csv", sep=",")
data = data[["X", "Y", "Z", "Tilt_C", "Tilt_A", "Radiation_C"]]
predict = "Radiation_C"
A = np.array(data.drop([predict], 1))
B = np.array(data[predict])
# Split data for Train and Test
a_train, a_test, b_train, b_test = train_test_split(A, B, test_size=0.25)
# Fitting Random Forest Regression to the dataset
# create regressor object
rf = RandomForestRegressor(random_state=42)
# fit the regressor with A and B data
rf.fit(a_train, b_train)
# Calculate accuracy
b_pred = rf.predict(a_test)
print('R^2:', metrics.r2_score(b_test, b_pred))
# Perform Cross Validation & scores
scores = cross_val_score(rf,A, B, cv=5)
print(scores)
print("Mean: ", mean(scores))
I'm doing cross fold validation in scikit-learn. Here the script:
import pandas as pd
import numpy as np
from time import time
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold
r_filenameTSV = "TSV/A19784.tsv"
#DF 300 dimension start
tsv_read = pd.read_csv(r_filenameTSV, sep='\t', names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(" ", 1).tolist(), columns=['label', 'vector'])
print(df)
#DF 300 dimension end
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1).ravel()
print(y.shape)
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
start = time()
clf = svm.SVC(kernel='rbf',
C=32,
gamma=8,
)
print("K-Folds scores:")
originalclass = []
predictedclass = []
def classification_report_with_accuracy_score(y_true, y_pred):
originalclass.extend(y_true)
predictedclass.extend(y_pred)
return accuracy_score(y_true, y_pred) # return accuracy score
inner_cv = StratifiedKFold(n_splits=10)
outer_cv = StratifiedKFold(n_splits=10)
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv,
scoring=make_scorer(classification_report_with_accuracy_score))
# Average values in classification report for all folds in a K-fold Cross-validation
print(classification_report(originalclass, predictedclass))
print("10 folds processing seconds: {}".format(time() - start))
As you can see I'm using as input data a Pandas data frame which has 300 features.
How to reduce the feature from 300 to 100?
Everything has to be done in Pandas (i.e creating a df with max 100 features per record) or I can use directly scikit-learn?
there are many ways to reduce the number of features in ML models here are some of them
use statistical methods such as Information Gain and Fisher Score, compute this score between your features and target and then select top 100
remove constant or quasi constant features
There are wrapper methods such as forward feature selection and backward feature selection and their idea is to search feature space and choose the best combination for this method you can use mlxtend.feature_selection this package is rather compatible with scikit learn
use PCA, LDA, ....
you can use embedded methods such as Lasso, Ridge or Random forest use this module from scikit learn: sklearn.feature_selection and import SelectFromModel
use correlation to covariance to determine which features are not contributing to accuracy. Dimension reduction reduces confusion and simplifies your model without compromising accuracy. Another approach is to use a stepwise refinement and look at area under the curve scores for features and remove the features not contributing significantly. Use tsne to visualize your feature clusters - non supervised learning.
https://github.com/dnishimoto/python-deep-learning/blob/master/ANSUR%202%20-%20Army%20-%20Dimension%20reduction.ipynb
Im trying to make prediction with logistic regression and to test accuracy with Python and sklearn library. Im using data that I downloaded from here:
http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength
its excel file. I wrote a code, but I always get the same error, and the error is:
ValueError: Unknown label type: 'continuous'
I have used the same logic when I made linear regression, and it works for linear regression.
This is the code:
import numpy as np
import pandas as pd
import xlrd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Reading data from excel
data = pd.read_excel("DataSet.xls").round(2)
data_size = data.shape[0]
#print("Number of data:",data_size,"\n",data.head())
my_data = data[(data["Superpl"] == 0) & (data["FlyAsh"] == 0) & (data["BlastFurSlag"] == 0)].drop(columns=["Superpl","FlyAsh","BlastFurSlag"])
my_data = my_data[my_data["Days"]<=28]
my_data_size = my_data.shape[0]
#print("Size of dataset for 28 days or less:", my_data_size, "\n", my_data.head())
def logistic_regression(data_input, cement, water,
coarse_aggr, fine_aggr, days):
variable_list = []
result_list = []
for column in data_input:
variable_list.append(column)
result_list.append(column)
variable_list = variable_list[:-1]
result_list = result_list[-1]
variables = data_input[variable_list]
results = data_input[result_list]
#accuracy of prediction (splittig dataframe in train and test)
var_train, var_test, res_train, res_test = train_test_split(variables, results, test_size = 0.3, random_state = 42)
#making logistic model and fitting the data into logistic model
log_regression = linear_model.LogisticRegression()
model = log_regression.fit(var_train, res_train)
input_values = [cement, water, coarse_aggr, fine_aggr, days]
#predicting the outcome based on the input_values
predicted_strength = log_regression.predict([input_values]) #adding values for prediction
predicted_strength = round(predicted_strength[0], 2)
#calculating accuracy score
score = log_regression.score(var_test, res_test)
score = round(score*100, 2)
prediction_info = "\nPrediction of future strenght: " + str(predicted_strength) + " MPa\n"
accuracy_info = "Accuracy of prediction: " + str(score) + "%\n"
full_info = prediction_info + accuracy_info
return full_info
print(logistic_regression(my_data, 376.0, 214.6, 1003.5, 762.4, 3)) #true value affter 3 days: 16.28 MPa
Although you don't provide details of your data, judging from the error and the comment in the last line of your code:
#true value affter 3 days: 16.28 MPa
I conclude that you are in a regression (i.e numeric prediction) setting. A linear regression is an appropriate model for this task, but a logistic regression is not: logistic regression is for classification problems, and thus it expects binary (or categorical) data as target variables, not continuous values, hence the error.
In short, you are trying to apply a model that is inappropriate for your problem.
UPDATE (after link to the data): Indeed, reading closely the dataset description, you'll see (emphasis added):
The concrete compressive strength is the regression problem
while from scikit-learn User's Guide for logistic regression (again, emphasis added):
Logistic regression, despite its name, is a linear model for classification rather than regression.
As a beginner in scikit-learn, and trying to classify the iris dataset, I'm having problems with adjusting the scoring metric from scoring='accuracy' to others like precision, recall, f1 etc., in the cross-validation step. Below is the full code sample (enough to start at # Test options and evaluation metric).
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection # for command model_selection.cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'
#Below, we build and evaluate 6 different models
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn, we calculate the cv-scores, ther mean and std for each model
#
results = []
names = []
for name, model in models:
#below, we do k-fold cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Now, apart from scoring ='accuracy', I'd like to evaluate other performance metrics for this multiclass classification problem. But when I use, scoring='precision', it raises:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
My questions are:
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct? If yes, then, which command(s) should replace scoring='accuracy' in the code above?
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
Why is this happening, when the model evaluation documentation (https://scikit-learn.org/stable/modules/model_evaluation.html) clearly says balanced_accuracy is a scoring method? I'm quite confused here, so an actual code to show how to evaluate other performance etrics would be appreciated! Thanks inn advance!!
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct?
No. Precision & recall are certainly valid for multi-class problems, too - see the docs for precision & recall.
If yes, then, which command(s) should replace scoring='accuracy' in the code above?
The problem arises because, as you can see from the documentation links I have provided above, the default setting for these metrics is for binary classification (average='binary'). In your case of multi-class classification, you need to specify which exact "version" of the particular metric you are interested in (there are more than one); have a look at the relevant page of the scikit-learn documentation, but some valid options for your scoring parameter could be:
'precision_macro'
'precision_micro'
'precision_weighted'
'recall_macro'
'recall_micro'
'recall_weighted'
The documentation link above contains even an example of using 'recall_macro' with the iris data - be sure to check it.
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
This is not exactly trivial, but you can see a way in my answer for Cross-validation metrics in scikit-learn for each data split
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
This is because you are probably using an older version of scikit-learn. balanced_accuracy became available only in v0.20 - you can verify that it is not available in v0.18. Upgrade your scikit-learn to v0.20 and you should be fine.