I want to apply the SVM classifier to my problem where the prediction vector has two classes. SVM shows an error as "bad input" when I try to input such a prediction vector. If it's possible to provide such input to SVM? If not, how to cope with this issue?
Y = np.zeros((len(y), max(y)+1))
for i in range(len(y)):
Y[i, y[i]] = 1
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf.get_n_splits(X)
print(kf)
KFold(n_splits=3, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = Y[train_index], Y[test_index]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Matrix Y appears as below
enter image description here
Related
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
Now I want to plot a loss curve for number of epochs for both train and test data, how do I do that?
I tried many thing which included tensorflow, keras and all. But I ended up being confused.
I am trying to use voting model in the ensembling methods, how can i add a cross validation to it?
Thanks
model_1 = SGD_cls=SGDClassifier(random_state=0)
model_2 = DecisionTreeClassifier(criterion='entropy', max_depth=12,max_leaf_nodes=35,splitter='best')
model_3 = GradientBoostingClassifier(n_estimators=100, random_state=42)
model_4 = MLPClassifier(random_state=42)
# model_3 = KNeighborsClassifier(n_neighbors=2)
X = df1.drop('product', axis = 1)
y = df1['product']
X_new = res_fit.transform(X)
#X_new =pd.DataFrame(X_new,columns = X.iloc[:,res_fit.support_].columns)
y_pred=cross_val_predict(model,X_new,y,cv=10)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.25, random_state=42)
model_5 = VotingClassifier([('ADA', model_1),
('Tree', model_2),
('GradBoost',model_3),
('MLP',model_4)],
voting='hard')
for model in (model_1, model_2, model_3,model_4,model_5):
model.fit(X_train, y_train)
print(model.__class__.__name__, model.score(X_test, y_test))
In the course of assessing a trained model synthesized for the regression problem below, I have some confusion in plotting the resulting history. In particular, when I don't consider any metrics
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(
X_train_full, y_train_full)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.fit_transform(X_valid)
X_test = scaler.fit_transform(X_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(30, tf.keras.activations.relu, input_shape=X_train.shape[1:]),
tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.mean_squared_error,
optimizer=tf.keras.optimizers.SGD())
history = model.fit(X_train, y_train, epochs=20,
validation_data=(X_valid, y_valid))
pd.DataFrame(history.history).plot()
plt.grid(True)
plt.show()
the final plot includes loss and val_loss graphs as expected.
But once I add a metrics to my model, say, tf.keras.metrics.MeanSquaredError(), the resulting plot generated by
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(
X_train_full, y_train_full)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.fit_transform(X_valid)
X_test = scaler.fit_transform(X_test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(30, tf.keras.activations.relu, input_shape=X_train.shape[1:]),
tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.mean_squared_error,
optimizer=tf.keras.optimizers.SGD(),
metrics=[tf.keras.metrics.MeanSquaredError()])
history = model.fit(X_train, y_train, epochs=20,
validation_data=(X_valid, y_valid))
pd.DataFrame(history.history).plot()
plt.grid(True)
plt.show()
lacks loss and val_loss sketches.
What's the problem here?
Edit:
Here is the content of history.history:
{'loss': [0.880902886390686, 0.6208109855651855, 0.5102624297142029, 0.47074252367019653, 0.4556053578853607, 0.4464321732521057, 0.44210636615753174, 0.43378400802612305, 0.42544370889663696, 0.428415447473526], 'mean_squared_error': [0.880902886390686, 0.6208109855651855, 0.5102624297142029, 0.47074252367019653, 0.4556053578853607, 0.4464321732521057, 0.44210636615753174, 0.43378400802612305, 0.42544370889663696, 0.428415447473526], 'val_loss': [0.6332216262817383, 0.514700710773468, 0.4509757459163666, 0.46695834398269653, 0.5228265523910522, 0.6748611330986023, 0.6648175716400146, 0.7329052090644836, 0.8352308869361877, 1.081600546836853], 'val_mean_squared_error': [0.6332216262817383, 0.514700710773468, 0.4509757459163666, 0.46695834398269653, 0.5228265523910522, 0.6748611330986023, 0.6648175716400146, 0.7329052090644836, 0.8352308869361877, 1.081600546836853]}
Your loss is the mean squared error and your metric is the mean squared error, which is exactly the same. It means they are overlapping when you plot them !
I'm trying to print the accuracy score for an XGBoost multilabel classifier. However, I'm stuck on this error:
ValueError: Classification metrics can't handle a mix of
multilabel-indicator and binary targets
I think y_test needs to not be one-hot encoded when passed to accuracy_score()? But everything I've tried creates more errors. Any idea how I get this to work?
Code:
X = X.reshape(X.shape[0], -1)
print(X.shape)
# Split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state=42, stratify=y)
dtrain = xgb.DMatrix(data=x_train, label=y_train)
dtest = xgb.DMatrix(data=x_test, label=y_test)
eval_list = [(dtest, 'eval')]
# Train the model
params = {
'max_depth': 3,
'objective': 'multi:softmax',
'num_class': 3,
'tree_method':'gpu_hist'
}
# Train the model
model = xgb.train(params, dtrain, evals=eval_list, early_stopping_rounds=20, verbose_eval=True)
# Evaluate predictions
y_pred = model.predict(dtest)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Adding argmax to y_test seemed to work:
accuracy = accuracy_score(y_test.argmax(axis=1), predictions)
I am using StratifiedKFold to checking the performance of my classifier. I have two classes and I trying to build Logistic Regression classier.
Here is my code
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in skf.split(x, y):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = LogisticRegression(class_weight='balanced')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
r.append(score)
print(score)
print(np.mean(r))
I could just print the score of the performance but I couldn't figure out how to print the confusion matrix and classification report.If I just add print statement inside the loop,
print(confusion_matrix(y_test, y_pred))
it will print it 10 times, but I want to report and a matrix of the final performance of the classifier.
Any help about how to calculation the matrix and the report. Thanks
Cross validation is used to asses the performance of particular models or hyperparameters across different splits of a dataset. At the end you don't have a final performance per se, you have the individual performance of each split and the aggregated performance across splits. You could potentially use the tn, fn, fp, tp for each to create an aggregated precision, recall, sensitivity, etc... but then you could also just use the predefined functions for those metrics in sklearn and aggregate them at the end.
e.g.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accs, precs, recs = [], [], []
for train_index, test_index in skf.split(x, y):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = LogisticRegression(class_weight='balanced')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
accs.append(acc)
precs.append(prec)
recs.append(rec)
print(f'Accuracy: {acc}, Precision: {prec}, Recall: {rec}')
print(f'Mean Accuracy: {np.mean(accs)}, Mean Precision: {np.mean(precs)}, Mean Recall: {np.mean(recs)}')