Train, test score disrepancy with datasize

Train, test score disrepancy with datasize - python

I'm trying to apply ML on atomic structures using descriptors. My problem is that I get very different score values depending on the datasize I use, I suspect that something is wrong with my model, any suggestions would be appreciated. I used dataset from this paper (Dataset MoS2(single)).
Here is the my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ase
from dscribe.descriptors import SOAP
from dscribe.descriptors import CoulombMatrix
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from ase.io import read
materials = read('structures.xyz', index=':')
materials = materials[:5000]
energies = pd.read_csv('Energy.csv')
energies = np.array(energies['b'])
energies = energies[:5000]
species = ["H", 'Mo', 'S']
rcut = 8.0
nmax = 1
lmax = 1
# Setting up the SOAP descriptor
soap = SOAP(
species=species,
periodic=False,
rcut=rcut,
nmax=nmax,
lmax=lmax,
)
coulomb_matrices = soap.create(materials, positions=[[51]]*len(materials))
nsamples, nx, ny = coulomb_matrices.shape
d2_train_dataset = coulomb_matrices.reshape((nsamples,nx*ny))
df = pd.DataFrame(d2_train_dataset)
df['target'] = energies
from sklearn.preprocessing import StandardScaler
X = df.iloc[:, 0:12].values
y = df.iloc[:, 12:].values
st_x = StandardScaler()
st_y = StandardScaler()
X = st_x.fit_transform(X)
y = st_y.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#krr = GridSearchCV(
# KernelRidge(kernel="rbf", gamma=0.1),
# param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3], "gamma": np.logspace(-2, 2, 5)},
#)
svr = GridSearchCV(
SVR(kernel="rbf", gamma=0.1),
param_grid={"C": [1e0, 1e1, 1e2, 1e3], "gamma": np.logspace(-2, 2, 5)},
)
svr = svr.fit(X_train, y_train.ravel())
print("Training set score: {:.4f}".format(svr.score(X_train, y_train)))
print("Test set score: {:.4f}".format(svr.score(X_test, y_test)))
and score:
Training set score: 0.0414
Test set score: 0.9126

I don't have a full answer to your problem as recreating it would be very cumbersome, but here are some questions to check:
a) You are training on 5 CrossValidation folds (default). First you should check the results of all parameter combinations right after the fitting process with "svr.best_score_" (or more detailed with "svr.cv_results_dict") and see what mean score your folds actually produced. If the score is really is as low as 0.04 (I assume higher is better, which these scores usually do), taking the reciprocal prediction would actually be really accurate! If you know you're always wrong, it's really easy to be right. ;D
b) You could go ahead and just use the svr.best_params_ in order to train again on the whole X_train-set instead of the folds (this can also be achieved with the "refit"-option of RandomSearchCV as well) and then check with the test set again. Here could also be the actual error: The documentation for the score method of GridSearchCV reads: "Return the score on the given data, if the estimator has been refit." This is not the case in your gridsearch! Try turning the refit option on. Maybe that works? ... sorry, your code was too cumbersome to be replicated fast, so I didn't check myself ...

Related

Why LightGBM Python-package gives bad prediction using for regression task?

I have a sample time-series dataset (23, 208), which is a pivot table count for 24hrs count for some users; I was experimenting with different regressors from sklearn which work fine (except for SGDRegressor()), but this LightGBM Python-package gives me very linear prediction as follows:
my tried code:
import pandas as pd
dff = pd.read_csv('ex_data2.csv',sep=',')
dff.set_index("timestamp",inplace=True)
print(dff.shape)
from sklearn.model_selection import train_test_split
trainingSetf, testSetf = train_test_split(dff,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
username = 'MMC_HEC_LVP' # select one column for plotting & check regression performance
user_list = []
for column in dff.columns:
user_list.append(column)
index = user_list.index(username)
X_trainf = trainingSetf.iloc[:,:].values
y_trainf = trainingSetf.iloc[:,:].values
X_testf = testSetf.iloc[:,:].values
y_testf = testSetf.iloc[:,:].values
test_set_copy = y_testf.copy()
model_LGBMRegressor = MultiOutputRegressor(lgb.LGBMRegressor()).fit(X_trainf, y_trainf)
pred_LGBMRegressor = model_LGBMRegressor.predict(X_testf)
test_set_copy[:,[index]] = pred_LGBMRegressor[:,[index]]
#plot the results for selected user/column
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12, 10))
plt.xlabel("Date")
plt.ylabel("Values")
plt.title(f"{username} Plot")
plt.plot(trainingSetf.iloc[:,[index]],label='trainingSet')
plt.plot(testSetf.iloc[:,[index]],"--",label='testSet')
plt.plot(test_set_copy[:,[index]],'b--',label='RF_predict')
plt.legend()
So what I am missing is if I use default (hyper-)parameters?

Short Answer
Your dataset has a very small number of rows, and LightGBM's parameters have default values set to provide good performance on medium-sized datasets.
Set the following parameters to force LightGBM to fit to the provided data.
min_data_in_bin = 1
min_data_in_leaf = 1
Long Answer
Before training, LightGBM does some pre-processing on the input data.
For example:
bundling sparse features
binning continuous features into histograms
dropping features which are guaranteed to be uninformative (for example, features which are constant)
The result of that preprocessing is a LightGBM Dataset object, and running that preprocessing is called Dataset "construction". LightGBM performs boosting on this Dataset object, not raw data like numpy arrays or pandas data frames.
To speed up construction and prevent overfitting during training, LightGBM provides ability to the prevent creation of histogram bins that are too small (min_data_in_bin) or splits that produce leaf nodes which match too few records (min_data_in_leaf).
Setting those parameters to very low values may be required to train on small datasets.
I created the following minimal, reproducible example, using Python 3.8.12, lightgbm==3.3.2, numpy==1.22.2, and scikit-learn==1.0.2 demonstrating this behavior.
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
)
# training produces 0 trees, and predicts mean(y)
reg = LGBMRegressor(
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.000
# training fits and predicts well
reg = LGBMRegressor(
min_data_in_bin=1,
min_data_in_leaf=1,
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.985
If you use LGBMRegressor(min_data_in_bin=1, min_data_in_leaf=1) in the code in the original post, you'll see predictions that better fit to the provided data.

In this way the model is overfitted!
If you do a random split after creating the dataset and evaluate the model on the test dataset, you will notice that the performance is essentially the same or worse (as in this example).
# SETUP
# =============================================================
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=200, n_informative=10, n_features=40, random_state=123
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
# =============================================================
# TEST 1
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.815
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.974
# =============================================================
# TEST 2
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X_train, y_train)
print(f"r2 (defaults): {r2_score(y_train, reg.predict(X_train))}")
# 0.759
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X_train, y_train)
print(f"r2 (small min_data): {r2_score(y_test, reg.predict(X_test))}")
# 0.219

Difference is value between xgb.train and xgb.XGBRegressor in Python for certain cases

I noticed that there are two possible implementations of XGBoost in Python as discussed here and here
When I tried running the same dataset through the two possible implementations I noticed that the results were different.
Code
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import xgboost
import pandas as pd
import numpy as np
from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)
Y = df["target"]
X = df.drop('target', axis=1)
#### Code using Native Impl for XGBoost
dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]
model = xgboost.train(dtrain=dtrain, params=params,num_boost_round=200)
predictions = model.predict(dtrain)
#### Code using Sklearn Wrapper for XGBoost
model = XGBRegressor(n_estimators = 200, max_depth=3, learning_rate =.05, min_child_weight=4, subsample=0.8 )
#model = model.fit(X, Y, eval_set = [(X, Y), (X, Y)], eval_metric = 'rmse', verbose=True)
model = model.fit(X, Y)
predictions2 = model.predict(X)
print(np.absolute(predictions-predictions2).sum())
Absolute difference sum using sklearn boston dataset
62.687134
When I ran the same for other datasets like the sklearn diabetes dataset I observed that the difference was much smaller.
Absolute difference sum using sklearn diabetes dataset
0.0011711121

Make sure random seeds are the same.
For both approaches set the same seed
param['seed'] = 123
EDIT: then there are a couple of different things.
First is n_estimators also 200? Are you imputing missing values in the second dataset also with 0? are others default values also the same(for this one I think yes because its a wrapper, but check other 2 things)

I've not set the "missing" parameter for the sklearn implementation. Once that was set the values were matching.
Also as Noah pointed out, sklearn wrapper has a few different default values which needs to be matched in order to exactly match the results.

Cross Validation LOO Model not running, no errors reported

When I run this cross validation leave one out, it does nothing, not even an error message. I can't figure out what I am missing. I'm using the csv from kaggle - https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression/downloads/heart-disease-prediction-using-logistic-regression.zip/1
import csv
from sklearn.model_selection import LeaveOneOut
from sklearn import svm
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import cross_val_score,
cross_val_predict
from sklearn import metrics
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
#replace missing values with mean
dataset=read_csv("//Users/crystalfortress/Desktop/CompGenetics
/Final_Project_Comp/framingham.csv")
dataset.fillna(dataset.mean(), inplace=True)
print(dataset.isnull().sum())
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 15].values
model = svm.SVC(kernel='linear', C=10, gamma = 0.1)
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print('Accuracy after cross validation:', scores.mean())
predictions = cross_val_predict(model, X, y, cv=loo)
accuracy = metrics.r2_score(y, predictions)
print('Prediction accuracy:', accuracy)
x = metrics.classification_report(y, predictions)
print(x)
cf = metrics.confusion_matrix(y, predictions)
print(cf)

I tried running this on my machine, it looks like your code works fine (though I did comment out a lot of unnecessary imports). Leave One Out will take a really long time to train a model (it makes n training data sets out of n data points link). So you will have to wait for it to train before you will get your results. Changing cv= to a number (default is 3, I believe) will train a model more quickly, and most likely have less variance in the model. Also, add n_jobs=-1 to your cross_val_score call will allow python access to all of your processors. E.g.: scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1)
You can also set the verbose parameter in your cross_val_score as well to watch the progress (though, heads up, it isn't fast). I believe the highest value is 3 (it doesn't mention it in their online docs), but using a higher value is fine. So the final cross_val_score call would look like:
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1, verbose=10)

Train-test split does not seem to work properly in Python?

I am trying to run a kNN (k-nearest neighbour) algorithm in Python.
The dataset I am using to try and do this is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wine
Here is the code I am using:
#1. LIBRARIES
import os
import pandas as pd
import numpy as np
print os.getcwd() # Prints the working directory
os.chdir('C:\\file_path') # Provide the path here
#2. VARIABLES
variables = pd.read_csv('wines.csv')
winery = variables['winery']
alcohol = variables['alcohol']
malic = variables['malic']
ash = variables['ash']
ash_alcalinity = variables['ash_alcalinity']
magnesium = variables['magnesium']
phenols = variables['phenols']
flavanoids = variables['flavanoids']
nonflavanoids = variables['nonflavanoids']
proanthocyanins = variables['proanthocyanins']
color_intensity = variables['color_intensity']
hue = variables['hue']
od280 = variables['od280']
proline = variables['proline']
#3. MAX-MIN NORMALIZATION
alcoholscaled=(alcohol-min(alcohol))/(max(alcohol)-min(alcohol))
malicscaled=(malic-min(malic))/(max(malic)-min(malic))
ashscaled=(ash-min(ash))/(max(ash)-min(ash))
ash_alcalinity_scaled=(ash_alcalinity-min(ash_alcalinity))/(max(ash_alcalinity)-min(ash_alcalinity))
magnesiumscaled=(magnesium-min(magnesium))/(max(magnesium)-min(magnesium))
phenolsscaled=(phenols-min(phenols))/(max(phenols)-min(phenols))
flavanoidsscaled=(flavanoids-min(flavanoids))/(max(flavanoids)-min(flavanoids))
nonflavanoidsscaled=(nonflavanoids-min(nonflavanoids))/(max(nonflavanoids)-min(nonflavanoids))
proanthocyaninsscaled=(proanthocyanins-min(proanthocyanins))/(max(proanthocyanins)-min(proanthocyanins))
color_intensity_scaled=(color_intensity-min(color_intensity))/(max(color_intensity)-min(color_intensity))
huescaled=(hue-min(hue))/(max(hue)-min(hue))
od280scaled=(od280-min(od280))/(max(od280)-min(od280))
prolinescaled=(proline-min(proline))/(max(proline)-min(proline))
alcoholscaled.mean()
alcoholscaled.median()
alcoholscaled.min()
alcoholscaled.max()
#4. DATA FRAME
d = {'alcoholscaled' : pd.Series([alcoholscaled]),
'malicscaled' : pd.Series([malicscaled]),
'ashscaled' : pd.Series([ashscaled]),
'ash_alcalinity_scaled' : pd.Series([ash_alcalinity_scaled]),
'magnesiumscaled' : pd.Series([magnesiumscaled]),
'phenolsscaled' : pd.Series([phenolsscaled]),
'flavanoidsscaled' : pd.Series([flavanoidsscaled]),
'nonflavanoidsscaled' : pd.Series([nonflavanoidsscaled]),
'proanthocyaninsscaled' : pd.Series([proanthocyaninsscaled]),
'color_intensity_scaled' : pd.Series([color_intensity_scaled]),
'hue_scaled' : pd.Series([huescaled]),
'od280scaled' : pd.Series([od280scaled]),
'prolinescaled' : pd.Series([prolinescaled])}
df = pd.DataFrame(d)
#5. TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.matrix(df),np.matrix(winery),test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
#6. K-NEAREST NEIGHBOUR ALGORITHM
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
In section 5, when I run sklearn.model_selection to import the train-test split mechanism, this does not appear to be running correctly because it provides the shapes: (0,13) (0,178) (1,13) (1,178).
Then, upon trying to run the knn, I get the error message: Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required. This is not due to scaling with max-min normalisation as I still get this error message even when the variables are not scaled.

I'm not exactly sure where your code is going wrong, it's a slightly different way of going about it compared to the sklearn docs. However, I can show you a different way of getting the train test split to work on the wine dataset for you.
from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = load_wine(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3)
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV

I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.

Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Train, test score disrepancy with datasize - python

Related

Why LightGBM Python-package gives bad prediction using for regression task?

Difference is value between xgb.train and xgb.XGBRegressor in Python for certain cases

Cross Validation LOO Model not running, no errors reported

Train-test split does not seem to work properly in Python?

How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV

Categories

Resources