I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.
label is on the first column.
CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:
def cross(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
Also I get similar accuracy with the code below:
def train(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
But If I don't use shuffle in the code below it results c. %49
If I use shuffle then it results c. %65
I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.
dataset = pd.read_csv("./dataset.csv", header=0,sep=";")
dataset = shuffle(dataset) #!!!???
X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]
X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
Assuming your question is "Why does it happen":
In both your first and second code snippets you have underlying shuffling happening (in your cross validation and your train_test_split methods), therefore they are equivalent (both in score and algorithm) to your last snippet with shuffling "on".
Since your original dataset is ordered by date there might be (and usually likely) some data that changes over time, which means that since your classifier never sees data from the last 1000 time points - it is unaware of the change in the underlying distribution and therefore fails to classify it.
Addendum to answer further data in comment:
This suggests that there might be some indicative process that is captured in smaller time frames. Two interesting ways to explore it are:
reduce the size of the test set until you find a window size in which the difference between shuffle/no shuffle is negligible.
this process essentially manifests as some dependence between your features so you could see if in a small time frame there is a dependence between your features
Related
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)
You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)
I have a dataset with more than 120 features, and I want to use RFE for selecting which features / column names I should use.
I have a problem because RFE is very slow. My code looks like this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
full_df = pd.read_csv('data.csv')
x = full_df.iloc[:,:-1]
y = full_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
model = LogisticRegression(solver ='lbfgs')
for i in range(1,120):
rfe = RFE(model, i)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
print(acc)
print(fit.support_)
My problem is this: rfe = RFE(model, i). I do not know what's the best number for i. That's why I put it in for i in range(1,120). Is there any better way to do this? is there any better function in scikit learn that can help me determine the number of features and names of those features?
Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.
First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.
pca = PCA()
fit = pca.fit(x)
Then I split my data into train and test (with 121 features).
Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:
model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
End finally I used 'RFE'. I have used number of columns that i get with 'SelectFromModel'.
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
Is this a good approach, or I did something wrong?
Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?
I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.
I am making a predictive model in python with scikit-learn, and I am trying to cross validate to get a valid F1 score. However, depending on my CV method, I am getting very different results. It seems that issues like this are typically due to overfitting or bad data, but in this case that shouldn't explain how my methods give consistent internal results even with different splits, but always differ from each other. x is my dataset, y is the labels, and rf_best is my classifier. For example:
cv_scores = cross_val_score(rf_best, x, y, cv=5, scoring='f1')
avg_cv_score = np.mean(cv_scores)
print cv_scores
avg_cv_score
returns
Out[227]:
[ 0.39825853 0.55969686 0.58727599 0.64060356 0.41976476]
0.52111993918160837
while (changing cv from 5 splits to a ShuffleSplit function)
cv = ShuffleSplit(len(y), n_iter=5, test_size=0.25, random_state=1)
cv_scores = cross_val_score(rf_best, x, y, cv=cv, scoring='f1')
avg_cv_score = np.mean(cv_scores)
print cv_scores
avg_cv_score
returns
Out[228]:
[ 0.88029259 0.86664242 0.8898564 0.87900669 0.86130213]
0.87542004725953615
I'm sure the classifier isn't performing this well, and I don't see how I could be overfitting with 5 shufflesplits, especially as I rerun it over and over. And this:
scores = []
for train, test in KFold(len(y), n_folds=5): #.25 tt split
xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]
rf_best.fit(xtrain, ytrain)
scores.append(f1_score(ytest, rf_best.predict(xtest)))
print scores
np.mean(scores)
returns
Out[224]:
[0.3365789517232205, 0.39921963139526107, 0.47179614359341149, .56085913206027882, 0.3765470091576244]
0.42900017358595932
How can 3 methods which are doing close to the same thing return such different results so consistently? Even when I change the random seed or test-set size, the results stay similar to what I posted above. Thanks for your time.