I'm doing cross fold validation in scikit-learn. Here the script:
import pandas as pd
import numpy as np
from time import time
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, make_scorer
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold
r_filenameTSV = "TSV/A19784.tsv"
#DF 300 dimension start
tsv_read = pd.read_csv(r_filenameTSV, sep='\t', names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(" ", 1).tolist(), columns=['label', 'vector'])
print(df)
#DF 300 dimension end
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1).ravel()
print(y.shape)
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
start = time()
clf = svm.SVC(kernel='rbf',
C=32,
gamma=8,
)
print("K-Folds scores:")
originalclass = []
predictedclass = []
def classification_report_with_accuracy_score(y_true, y_pred):
originalclass.extend(y_true)
predictedclass.extend(y_pred)
return accuracy_score(y_true, y_pred) # return accuracy score
inner_cv = StratifiedKFold(n_splits=10)
outer_cv = StratifiedKFold(n_splits=10)
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv,
scoring=make_scorer(classification_report_with_accuracy_score))
# Average values in classification report for all folds in a K-fold Cross-validation
print(classification_report(originalclass, predictedclass))
print("10 folds processing seconds: {}".format(time() - start))
As you can see I'm using as input data a Pandas data frame which has 300 features.
How to reduce the feature from 300 to 100?
Everything has to be done in Pandas (i.e creating a df with max 100 features per record) or I can use directly scikit-learn?
there are many ways to reduce the number of features in ML models here are some of them
use statistical methods such as Information Gain and Fisher Score, compute this score between your features and target and then select top 100
remove constant or quasi constant features
There are wrapper methods such as forward feature selection and backward feature selection and their idea is to search feature space and choose the best combination for this method you can use mlxtend.feature_selection this package is rather compatible with scikit learn
use PCA, LDA, ....
you can use embedded methods such as Lasso, Ridge or Random forest use this module from scikit learn: sklearn.feature_selection and import SelectFromModel
use correlation to covariance to determine which features are not contributing to accuracy. Dimension reduction reduces confusion and simplifies your model without compromising accuracy. Another approach is to use a stepwise refinement and look at area under the curve scores for features and remove the features not contributing significantly. Use tsne to visualize your feature clusters - non supervised learning.
https://github.com/dnishimoto/python-deep-learning/blob/master/ANSUR%202%20-%20Army%20-%20Dimension%20reduction.ipynb
Related
I've encountered a problem when I used kernel PCA implemented in sklearn. The order of the samples before kpca would significantly influence the classification accuracy.
Here is my processing procedure:
run Kernel PCA for input X(n_samples, n_components).
shuffle the X, split the X into training set and test set (10-fold).
use extratree classifier, svc or other classifiers implemented in sklearn to perform binary classification task.
my code
import numpy as np
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA
import sklearn.metrics as metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier as etclf
# load data
datapath=r"F:\..."
data=sio.loadmat(datapath+"\\...")
x=data["x"]
labels=data["labels"]
# kernel pca
gm=[1e-5]
nfea=x.shape[1]
kpca=KernelPCA(n_components=nfea,kernel='rbf',gamma=gm,eigen_solver="auto",random_state=(42))
x_pca=kpca.fit_transform(x)
# shuffle the x_pca with labels
x_shuffle,y_shuffle=shuffle(x_pca,labels,random_state=42)
data_label=np.concatenate((x_shuffle,y_shuffle),axis=1)
# 10-fold cross validation
kf = KFold(n_splits=10,shuffle=False)
for train,test in kf.split(data_label):
x_train=data_label[train,:-1]
x_test=data_label[test,:-1]
y_train=data_label[train,-1]
y_test=data_label[test,-1]
# binary classification prediction
clf=etclf(n_estimators=10,criterion='gini',random_state=42)
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
acc=metrics.accuracy_score(y_test,y_pred)
Before applying kernel pca, there are two kinds of orders of x :
I sorted the x by their labels from 1 to 0 (i.e., 1111111...111000000...000), I finally got the accuracy close to 0.99.
I shuffled the x with their labels (i.e., 1100011100...00101100100011), I finally got the accuracy about 0.50.
I also adopted other classifiers such as svc, gaussian naive bayes, the results were similar. I think it is not the matter of classifier or leakage between training set and test set. It is more likely that kpca makes high correlations between samples that are close in order. I don't know how to explain this result.
Thanks for help!
I have a sample time-series dataset (23, 208), which is a pivot table count for 24hrs count for some users; I was experimenting with different regressors from sklearn which work fine (except for SGDRegressor()), but this LightGBM Python-package gives me very linear prediction as follows:
my tried code:
import pandas as pd
dff = pd.read_csv('ex_data2.csv',sep=',')
dff.set_index("timestamp",inplace=True)
print(dff.shape)
from sklearn.model_selection import train_test_split
trainingSetf, testSetf = train_test_split(dff,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
username = 'MMC_HEC_LVP' # select one column for plotting & check regression performance
user_list = []
for column in dff.columns:
user_list.append(column)
index = user_list.index(username)
X_trainf = trainingSetf.iloc[:,:].values
y_trainf = trainingSetf.iloc[:,:].values
X_testf = testSetf.iloc[:,:].values
y_testf = testSetf.iloc[:,:].values
test_set_copy = y_testf.copy()
model_LGBMRegressor = MultiOutputRegressor(lgb.LGBMRegressor()).fit(X_trainf, y_trainf)
pred_LGBMRegressor = model_LGBMRegressor.predict(X_testf)
test_set_copy[:,[index]] = pred_LGBMRegressor[:,[index]]
#plot the results for selected user/column
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12, 10))
plt.xlabel("Date")
plt.ylabel("Values")
plt.title(f"{username} Plot")
plt.plot(trainingSetf.iloc[:,[index]],label='trainingSet')
plt.plot(testSetf.iloc[:,[index]],"--",label='testSet')
plt.plot(test_set_copy[:,[index]],'b--',label='RF_predict')
plt.legend()
So what I am missing is if I use default (hyper-)parameters?
Short Answer
Your dataset has a very small number of rows, and LightGBM's parameters have default values set to provide good performance on medium-sized datasets.
Set the following parameters to force LightGBM to fit to the provided data.
min_data_in_bin = 1
min_data_in_leaf = 1
Long Answer
Before training, LightGBM does some pre-processing on the input data.
For example:
bundling sparse features
binning continuous features into histograms
dropping features which are guaranteed to be uninformative (for example, features which are constant)
The result of that preprocessing is a LightGBM Dataset object, and running that preprocessing is called Dataset "construction". LightGBM performs boosting on this Dataset object, not raw data like numpy arrays or pandas data frames.
To speed up construction and prevent overfitting during training, LightGBM provides ability to the prevent creation of histogram bins that are too small (min_data_in_bin) or splits that produce leaf nodes which match too few records (min_data_in_leaf).
Setting those parameters to very low values may be required to train on small datasets.
I created the following minimal, reproducible example, using Python 3.8.12, lightgbm==3.3.2, numpy==1.22.2, and scikit-learn==1.0.2 demonstrating this behavior.
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
)
# training produces 0 trees, and predicts mean(y)
reg = LGBMRegressor(
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.000
# training fits and predicts well
reg = LGBMRegressor(
min_data_in_bin=1,
min_data_in_leaf=1,
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.985
If you use LGBMRegressor(min_data_in_bin=1, min_data_in_leaf=1) in the code in the original post, you'll see predictions that better fit to the provided data.
In this way the model is overfitted!
If you do a random split after creating the dataset and evaluate the model on the test dataset, you will notice that the performance is essentially the same or worse (as in this example).
# SETUP
# =============================================================
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=200, n_informative=10, n_features=40, random_state=123
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
# =============================================================
# TEST 1
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.815
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.974
# =============================================================
# TEST 2
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X_train, y_train)
print(f"r2 (defaults): {r2_score(y_train, reg.predict(X_train))}")
# 0.759
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X_train, y_train)
print(f"r2 (small min_data): {r2_score(y_test, reg.predict(X_test))}")
# 0.219
So this is what I put together to run the data through variance threshold for feature selection, then normalizer and LDA for dimensionality reduction.
The LDA element I'm not too sure about as I can't find any examples of this being used in a pipeline (as dimensionality reduction / data transformation technique as opposed to a standalone classifier.)
I am a bit worried, as when this is used and the transformed data passed on to a series classifiers - they result in a series of identical accuracy, precision, recall and F1 scores. Only the application of AdaBoost brings back something different.
Is there something I'm doing wrong here?
pipeline = Pipeline([
('feature_selection', VarianceThreshold()),
('normaliser', Normalizer()),
('lda', LinearDiscriminantAnalysis())], verbose = True)
X_train_post_pipeline = pipeline.fit_transform(X_train, Y_train)
X_test_post_pipeline = pipeline.transform(X_test)
LinearDiscriminantAnalysis is a is a dimensionality reduction technique that can be compared to PCA. Therefore it can be used within a pipeline as preprocessing.
It is possible that classifier that used its result end up with the same score as LDA project inputs to the most discriminative directions.
Below is an example of a pipeline that is using LDA as a preprocessing steps:
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import Normalizer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_classes=2)
pipe = make_pipeline(VarianceThreshold(),
Normalizer(),
LinearDiscriminantAnalysis(),
LogisticRegression())
pipe.fit(X, y)
I am using Random Forest for my Regression problem in python. I have rather large data (5 features, 1 target, with 9387 dataset).
At first, to obtain the accuracy, I used a simple RF code with train_test_split and metrics.r2_score and the result gave me a 0.9999 score on both train set and test set. Later, I tried to perform cross-validation using cross_val_score with 5 Folds. This gives me 5 numbers (see below) which I found some of them are weird to be the score of cross-validation.
[-1.44202691 0.25338018 0.70433516 0.98278159 -3.34943088]
Is it really possible to have a negative accuracy or there is something just wrong with how I code?
I am still new to coding and python so please bear with me. You can see my code below.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import csv
from sklearn.model_selection import cross_val_score, train_test_split
from statistics import mean
data = pd.read_csv("Size1.csv", sep=",")
data = data[["X", "Y", "Z", "Tilt_C", "Tilt_A", "Radiation_C"]]
predict = "Radiation_C"
A = np.array(data.drop([predict], 1))
B = np.array(data[predict])
# Split data for Train and Test
a_train, a_test, b_train, b_test = train_test_split(A, B, test_size=0.25)
# Fitting Random Forest Regression to the dataset
# create regressor object
rf = RandomForestRegressor(random_state=42)
# fit the regressor with A and B data
rf.fit(a_train, b_train)
# Calculate accuracy
b_pred = rf.predict(a_test)
print('R^2:', metrics.r2_score(b_test, b_pred))
# Perform Cross Validation & scores
scores = cross_val_score(rf,A, B, cv=5)
print(scores)
print("Mean: ", mean(scores))
How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier()
DTC.fit(X_train_one_hot,y_train)
print("output from DTC:")
res=DTC.predict_proba(X_test_one_hot)
new=list(chain.from_iterable(res))
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
print(event[i])
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:
DTC.fit(X_train_one_hot,y_train)
y_pred=event[i]
AC=accuracy_score((event,y_pred)*100)
print(AC)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))