I have two machine learning models with one target I run each one alone now am looking to concatenation between both to get one result ...
one of the model it content text with tf-idf and target and the another one it content 6 attributes with the target that means all of my data it content 6 attributes so am looking to be in one model
the first one it content two features
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
DTClass = DecisionTreeClassifier(criterion="gini", splitter="best",
random_state=77)
X_train, X_test, y_train, y_test = train_test_split(bow,
df1["attacktype1_txt"], test_size = 1/5, random_state = 50)
DTClass.fit(X_train,y_train)
prediction = DTClass.predict(X_test)
from sklearn.metrics import accuracy_score
print("accuracy score:")
print(accuracy_score(y_test, prediction))
and the second
array = df.values
X = array[:,1:7]
Y = array[:,7]
validation_size = 0.20
seed = 4
X_train, X_validation, Y_train, Y_validation =
model_selection.train_test_split(X, Y, test_size=validation_size,
random_state=seed)
seed = 4
scoring = 'accuracy'
models.append(('CART', DecisionTreeClassifier()))
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train,
cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Your problem seems less of an issue with merging models, rather, one with merging data. Unless you have reason to assume that model performance will decrease by inclusion of data, losing information by splitting models should be avoided.
In this case, it appears the data is a bit chaotic. Perhaps merge to a single X array (I'd suggest doing so in pandas) and a single y. If your y labels are not compatible, then you'd want to correct them.
Additionally, I'd suggest reviewing the following tools:
Voting Classifiers and Voting Regressors
An extra "hack" is to assign a model's accuracy or f1 score as the weight in the weighted vote. This can generate extreme overfitting, so proceed with caution.
Stacking Classifiers and Stacking Regressors
The outcomes of each model in the stack is used as input for the prediction of the final model. In my experience, this has comparable performance of an optimized MLP or single layer neural network.
Boosting, Extreme Gradient Boosting, and Light Gradient Boosting
Each are effective ensemble models which will work in well calibrated "teams" of estimators.
Related
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I'm trying to classify a text to a 6 different classes.
Since I'm having an imbalanced dataset, I'm also using SMOTETomek method that should synthetically balance the dataset with additional artificial samples.
I've noticed a huge score difference when applying it via pipeline vs 'Step by step" where the only difference is (I believe) the place I'm using train_test_split
Here are my features and labels:
for curr_features, label in self.training_data:
features.append(curr_features)
labels.append(label)
algorithms = [
linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None),
naive_bayes.MultinomialNB(),
naive_bayes.BernoulliNB(),
tree.DecisionTreeClassifier(max_depth=1000),
tree.ExtraTreeClassifier(),
ensemble.ExtraTreesClassifier(),
svm.LinearSVC(),
neighbors.NearestCentroid(),
ensemble.RandomForestClassifier(),
linear_model.RidgeClassifier(),
]
Using Pipeline:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
# Provide Report for all algorithms
score_dict = {}
for algorithm in algorithms:
model = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('smote', SMOTETomek()),
('classifier', algorithm)
])
model.fit(X_train, y_train)
# Score
score = model.score(X_test, y_test)
score_dict[model] = int(score * 100)
sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for classifier, score in sorted_score_dict.items():
print(f'{classifier.__class__.__name__}: score is {score}%')
Using Step by Step:
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
cv = vectorizer.fit_transform(features)
text_tf = transformer.fit_transform(cv).toarray()
smt = SMOTETomek()
X_smt, y_smt = smt.fit_resample(text_tf, labels)
X_train, X_test, y_train, y_test = train_test_split(X_smt, y_smt, test_size=0.2, random_state=0)
self.test_classifiers(X_train, X_test, y_train, y_test, algorithms)
def test_classifiers(self, X_train, X_test, y_train, y_test, classifiers_list):
score_dict = {}
for model in classifiers_list:
model.fit(X_train, y_train)
# Score
score = model.score(X_test, y_test)
score_dict[model] = int(score * 100)
print()
print("SCORE:")
sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for model, score in sorted_score_dict.items():
print(f'{model.__class__.__name__}: score is {score}%')
I'm getting (for the best classifier model) around 65% using pipeline vs 90% using step by step.
Not sure what am I missing.
There is nothing wrong with your code by itself. But your step-by-step approach is using bad practice in Machine Learning theory:
Do not resample your testing data
In your step-by-step approach, you resample all of the data first and then split them into train and test sets. This will lead to an overestimation of model performance because you have altered the original distribution of classes in your test set and it is not representative of the original problem anymore.
What you should do instead is to leave the testing data in its original distribution in order to get a valid approximation of how your model will perform on the original data, which is representing the situation in production. Therefore, your approach with the pipeline is the way to go.
As a side note: you could think about shifting the whole data preparation (vectorization and resampling) out of your fitting and testing loop as you probably want to compare the model performance against the same data anyway. Then you would only have to run these steps once and your code executes faster.
The correct approach in such cases is described in detail in own answer in the Data Science SE thread Why you shouldn't upsample before cross validation (although the answer is about CV, the rationale is identical for the train/test split case as well). In short, any resampling method (SMOTE included) should be applied only to the training data and not to the validation or test ones.
Given that, your Pipeline approach here is correct: you apply SMOTE only to your training data after splitting, and, according to the documentation of the imblearn pipeline:
The samplers are only applied during fit.
So, no SMOTE is actually applied to your test data during model.score, which is exactly as it should be.
Your step-by-step approach, on the other hand, is wrong on many levels, and SMOTE is only one of them; all these preprocessing steps should be applied after the train/test split, and fitted only on the training portion of your data, which is not the case here, thus the results are invalid (no wonder they look "better"). For a general discussion (and a practical demonstration) of how & why such preprocessing should be applied only to the training data, see my (2) answers in Should Feature Selection be done before Train-Test Split or after? (again, the discussion there is about feature selection, but it is applicable to such feature engineering tasks like count vectorizer and TF-IDF transformation as well).
When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.
I have a dataset with more than 120 features, and I want to use RFE for selecting which features / column names I should use.
I have a problem because RFE is very slow. My code looks like this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
full_df = pd.read_csv('data.csv')
x = full_df.iloc[:,:-1]
y = full_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
model = LogisticRegression(solver ='lbfgs')
for i in range(1,120):
rfe = RFE(model, i)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
print(acc)
print(fit.support_)
My problem is this: rfe = RFE(model, i). I do not know what's the best number for i. That's why I put it in for i in range(1,120). Is there any better way to do this? is there any better function in scikit learn that can help me determine the number of features and names of those features?
Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.
First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.
pca = PCA()
fit = pca.fit(x)
Then I split my data into train and test (with 121 features).
Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:
model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
End finally I used 'RFE'. I have used number of columns that i get with 'SelectFromModel'.
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
Is this a good approach, or I did something wrong?
Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?
I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.
150 | 150
300 | 190
350 | 200
Basically made up numbers but I've fit the dataset into the linear regression model.
What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??
But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?
You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?
For new data (to predict on):
Your approach is correct. You can access all predictions by simply printing the y_pred variable.
You know the respective weight values and you want to evaluate model:
Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.
If you do not know the weight value of the 10 new data points:
Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.
from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)
Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.
It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.
You have to select the model using model_selection in sklearn then train and fit the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
What I'm confused on is, how would I go about predicting with new
data, say I got 10 new numbers of Calories Eaten, and I want it to
predict Weight?
Yes, Calories Eaten represents the independent variable while Weight represent dependent variable.
After you split the data into training set and test set the next step is to fit the regressor using X_train and y_train data.
After the model is trained you can predict the results for X_test method and so we got the y_pred.
Now you can compare y_pred (predicted data) with y_test which is real data.
You can also use score method for your created linear model in order to get the performance of your model.
score is calculated using R^2(R squared) metric or Coefficient of determination.
score = regressor.score(x_test, y_test)
For splitting the data you can use train_test_split method.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight, test_size = 0.2, random_state = 0)