I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
Related
This is for a project that's due soon so help would be greatly appreciated, I've never done ML before so sorry if the mistake is an absolute smooth brain one.
I have a dataset that's a bunch of tweets along with personality scores, and I need to train an model to predict the scores.
This is what I've done so far by following a bunch of tutorials and stitching together what I learned.
train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)
X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)
model.score(X_test_vec, y_test)
However I'm getting an error on the last line of code when I run it in the notebook.
ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]
Full error message: https://imgur.com/a/GS7jEi5
you are using x_train for both train and test and is the reason you are getting the error.
try:
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one
As pointed out below, we dont fit the test set.
BUT* you still need to use the X_test with y_test
I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)
You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)
I have two machine learning models with one target I run each one alone now am looking to concatenation between both to get one result ...
one of the model it content text with tf-idf and target and the another one it content 6 attributes with the target that means all of my data it content 6 attributes so am looking to be in one model
the first one it content two features
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
DTClass = DecisionTreeClassifier(criterion="gini", splitter="best",
random_state=77)
X_train, X_test, y_train, y_test = train_test_split(bow,
df1["attacktype1_txt"], test_size = 1/5, random_state = 50)
DTClass.fit(X_train,y_train)
prediction = DTClass.predict(X_test)
from sklearn.metrics import accuracy_score
print("accuracy score:")
print(accuracy_score(y_test, prediction))
and the second
array = df.values
X = array[:,1:7]
Y = array[:,7]
validation_size = 0.20
seed = 4
X_train, X_validation, Y_train, Y_validation =
model_selection.train_test_split(X, Y, test_size=validation_size,
random_state=seed)
seed = 4
scoring = 'accuracy'
models.append(('CART', DecisionTreeClassifier()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train,
cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Your problem seems less of an issue with merging models, rather, one with merging data. Unless you have reason to assume that model performance will decrease by inclusion of data, losing information by splitting models should be avoided.
In this case, it appears the data is a bit chaotic. Perhaps merge to a single X array (I'd suggest doing so in pandas) and a single y. If your y labels are not compatible, then you'd want to correct them.
Additionally, I'd suggest reviewing the following tools:
Voting Classifiers and Voting Regressors
An extra "hack" is to assign a model's accuracy or f1 score as the weight in the weighted vote. This can generate extreme overfitting, so proceed with caution.
Stacking Classifiers and Stacking Regressors
The outcomes of each model in the stack is used as input for the prediction of the final model. In my experience, this has comparable performance of an optimized MLP or single layer neural network.
Boosting, Extreme Gradient Boosting, and Light Gradient Boosting
Each are effective ensemble models which will work in well calibrated "teams" of estimators.
I am using LinearRegression(). Below you can see what I have already done to predict new features:
lm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=say)
lm.fit(X_train, y_train)
lm.predict(X_test)
scr = lm.score(X_test, y_test)
lm.fit(X, y)
pred = lm.predict(X_real)
Do I really need the line lm.fit(X, y) or can I just go without using it? Also, If I don't need to calculate accuracy, do you think the following approach is better instead using training and testing? (In case I don't want to test):
lm.fit(X, y)
pred = lm.predict(X_real)
Even I am getting 0.997 accuraccy, the predicted value is not close or shifted. Are there ways to make prediction more accurate?
You don't need to fit multiple times for predicting a value by given features since your algorithm already learned your train set. Check the codes below.
# Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
# Teach your data to your algorithm with train set
lr = LinearRegression()
lr.fit(X_train, y_train)
# Now it can predict
y_pred = lr.predict(X_test)
# Use test set to see how accurate it predicts
lr_score = lr.score(y_pred, y_test)
The reason you are getting almost 100% accuracy score is a data leakage, caused by the following line of code:
lm.fit(X, y)
in the line above you gave your model ALL the data and then you are testing prediction using the subset of data that your model has already seen.
This causes very high accuracy score for the already seen data, but usually it performs badly on the unseen data.
When do you want / need to fit your model multiple times?
If you are getting a new training data and want to improve your model by training it against a new portion of data, then you may want to choose one of regression algorithm, supporting incremental-learning.
In this case you will use model.partial_fit() method instead of model.fit()...
I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.
label is on the first column.
CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:
def cross(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
Also I get similar accuracy with the code below:
def train(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
But If I don't use shuffle in the code below it results c. %49
If I use shuffle then it results c. %65
I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.
dataset = pd.read_csv("./dataset.csv", header=0,sep=";")
dataset = shuffle(dataset) #!!!???
X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]
X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
Assuming your question is "Why does it happen":
In both your first and second code snippets you have underlying shuffling happening (in your cross validation and your train_test_split methods), therefore they are equivalent (both in score and algorithm) to your last snippet with shuffling "on".
Since your original dataset is ordered by date there might be (and usually likely) some data that changes over time, which means that since your classifier never sees data from the last 1000 time points - it is unaware of the change in the underlying distribution and therefore fails to classify it.
Addendum to answer further data in comment:
This suggests that there might be some indicative process that is captured in smaller time frames. Two interesting ways to explore it are:
reduce the size of the test set until you find a window size in which the difference between shuffle/no shuffle is negligible.
this process essentially manifests as some dependence between your features so you could see if in a small time frame there is a dependence between your features