I have a dataset with X.shape (104481, 34) and y.shape (104481,), and I want to train an SVM model on it.
The steps I do are (1) Split data, (2) Scale data, and (3) Train SVM:
(1) Split data:
Function:
from sklearn.model_selection import train_test_split
def split_data(X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12, stratify=y)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = split_data_set.split_data(X,y)
The 4 classes are the following. The data set is quite imbalanced, but that is an issue for later.
y_train.value_counts()
out:
Status_9_Substatus_8 33500
Other 33500
Status_62_Substatus_7 2746
Status_62_Substatus_30 256
Name: Status, dtype: int64
y_test.value_counts()
out:
Status_9_Substatus_8 16500
Other 16500
Status_62_Substatus_7 1352
Status_62_Substatus_30 127
Name: Status, dtype: int64
(2) Scale data:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_train_scaled.shape)
print(y_train.shape)
(3) Train and predict with SVM:
svm_method.get_svm_model(X_train_scaled, X_test_scaled, y_train, y_test)
Calling this method:
def get_svm_model(X_train, X_test, y_train, y_test):
print('Loading...')
print('Training...')
svm, y_train_pred, y_test_pred = train_svm_model(X_train,y_train, X_test)
print('Training Complete')
print('Plotting Confusion Matrix...')
performance_measure.plot_confusion_matrix(y_test,y_test_pred, normalize=True)
print('Plotting Performance Measure...')
performance_measure.get_performance_measures(y_test, y_test_pred)
return svm
Which calls this method:
def train_svm_model(X_train,y_train, X_test):
#
svm = SVC(kernel='poly', gamma='auto', random_state=12)
# Fitting the model
svm.fit(X_train, y_train)
# Predicting values
y_train_pred = svm.predict(X_train)
y_test_pred = svm.predict(X_test)
return svm, y_train_pred, y_test_pred
The resulting '''Output''' is this screenshot.
What is strange is that there are samples from all four classes present (since I used the stratify parameter when calling train_test_split), however, it looks like some of the classes disappear?
The SVM and confusion matrix functions worked well with a toy data set:
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.DataFrame(data.target)
y = np.array(y)
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
svm, y_train_pred, y_test_pred = train_svm_model(X_train, y_train, X_test)
get_svm_model(X_train, X_test, y_train, y_test)
Any idea what is going on here?
Thanks in advance.
The CM code:
def plot_confusion_matrix(y_true, y_pred,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
classes = unique_labels(y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
plt.show()
return ax
Your confusion matrix is not zero:
If we look at this on the x-axis you have the predicted label and on the y-axis the true labels, lets have a look at the third row from the top:
0.94: 0.94 of the true label: Status_62_Substatus_7 are predicted as class other, which is wrong
0.00 of the same true label are predicted also wrong
0.00 of the same true label are predicted wrong (this should be the correct predict value (higher is better)
0.06 are predicted again wrong
Is your problem is so imbalanced you just have 0 predictions for two labels.
Related
I made a random forest regression with filter for y and x variables and I also wanted to add more about shapley values by creating a graph and table with column for the variable and column for the shaply value result. The code plots the graph, but the table is not showing.
So far my code looks like this:
x=widgets.SelectMultiple(
options=list(dataset.select_dtypes('number').columns),
disabled=False,
value=("NUMBER_SPOTS",)
)
def randomforest(y, x):
x = dataset[list(x)]
y = dataset[y]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
shap.initjs()
model = RandomForestRegressor(random_state=0).fit(X_train, y_train)
y_predict = model.predict(X_test)
mean_squared_error(y_test, y_predict)**(0.5)
print('Mean Squared Error:', mean_squared_error(y_test, y_predict)**(0.5))
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, features=X_train, feature_names=X_train.columns, plot_size=[15,8])
shap_vals = shap_values[0, :]
feature_importance = pd.DataFrame(list(zip(X_train.columns, shap_vals)), columns=['X_train', 'shap_vals'])
feature_importance.sort_values(by=['shap_vals'], ascending=False,inplace=True)
feature_importance
interact(randomforest, y = list(dataset.select_dtypes('number').columns), x = x)
I'm trying to find the slope and y-intercept coefficients for a linear equation. I created a test domain and range to make sure the numbers I was receiving were correct. The equation should be y = 2x + 1, but the model is saying the slope is 24 and the y-intercept is 40.3125. The model accurately predicts every value I give it, but I'm questioning how I can get the proper values.
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = np.arange(0, 40)
y = (2 * X) + 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
X_train = [[i] for i in X_train]
X_test = [[i] for i in X_test]
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('Coefficients: \n', regr.coef_)
print('Y-intercept: \n', regr.intercept_)
print('Mean squared error: %.2f'
% mean_squared_error(y_test, y_pred))
print('Coefficient of determination: %.2f'
% r2_score(y_test, y_pred))
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
print(X_test)
plt.xticks()
plt.yticks()
plt.show()
This is happening because you scaled your training and testing data. So even though you generated y as a linear function of X, you converted X_train and X_test onto another scale by standardizing it (subtract the mean and divide by the standard deviation).
If we run your code but omit the lines where you scale the data, you get the expected results.
X = np.arange(0, 40)
y = (2 * X) + 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
X_train = [[i] for i in X_train]
X_test = [[i] for i in X_test]
# Skip the scaling of X_train and X_test
#sc = StandardScaler()
#X_train = sc.fit_transform(X_train)
#X_test = sc.transform(X_test)
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('Coefficients: \n', regr.coef_)
> Coefficients:
[2.]
print('Y-intercept: \n', regr.intercept_)
> Y-intercept:
1.0
I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.
The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:
Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:
data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)
X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
Plot confusion matrix
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(clf, X_test, Y_test,
display_labels=Y,
cmap=plt.cm.Blues,)
Confusion matrix
The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(8,8))
disp = plot_confusion_matrix(clf, X_test, Y_test,
labels=np.unique(y),
cmap=plt.cm.Blues,ax=ax)
I'm not able to see my resultant accuracy score in my final graph and I get precision/recall being ill-defined where I don't see any 0's.
I'm using this yeast data: https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data
I've tried making the whole set my training set by making train_frac=1.
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv("<my_dir>",names = ['sample','mcg', 'gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc','site'])
df=df.drop(columns=['sample'])
model_type = GaussianNB()
target = 'site'
train_frac = 0.5
Y = df[target]
df2 = df.drop(columns=[target])
dtype='object'). Everything but site.
X = df[df2.columns[:]]
def naive_split(X, Y, n):
# Take first n lines of X and Y for training and the rest for testing
X_train = X[:n]
X_test = X[n:]
Y_train = Y[:n]
Y_test = Y[n:]
return (X_train, X_test, Y_train, Y_test)
def train_model(n=int(train_frac*df.shape[0])):
X_train, X_test, Y_train, Y_test = naive_split(X, Y, n)
clf = model_type
clf = clf.fit(X_train, Y_train)
return (X_test, Y_test, clf)
X_test, Y_test, clf = train_model()
import sklearn.metrics as metrics
from sklearn import model_selection
sizes = np.arange(0.98,0.01, -0.02)
result = {}
for size in sizes:
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(
X, Y, test_size=size, random_state=200)
clf = model_type
clf = clf.fit(X_train, Y_train)
score = clf.score(X_test, Y_test)
precision = metrics.precision_score(Y_test, clf.predict(X_test), average='weighted')
recall = metrics.recall_score(Y_test, clf.predict(X_test), average='weighted')
result[len(Y_train)] = (score, precision, recall)
result = pd.DataFrame(result).transpose()
result.columns = ['Accuracy','Precision', 'Recall']
result.plot(marker='*', figsize=(15,5))
plt.title('Metrics measures using random train/test splitting')
plt.xlabel('Size of training set')
plt.ylabel('Value');
I get the following results when I expect it to run without error:
C:\Users\<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.'precision', 'predicted', average, warn_for)
C:\Users\<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1137: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. 'recall', 'true', average, warn_for)
I am trying to plot the train_test _split while maintaining the indices, here is my code.
#df.insert(0, 'x', range(0, 0 + len(df)))
X_train, X_test, y_train, y_test = train_test_split(x, y,
test_size = .1)
regressor = RandomForestClassifier()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
plt.plot(X_train,y_pred_train,'bo')
plt.show()
It seems like the y_pred are plotting against the incorrect x_axis value as there is a huge gap in the middle of the data and some overlapping
How can I make the corresponding x_value of the y_pred and y_pred_train be in their original position from the data frame.
you will need to include the index in the plot. usually y will be represented as the colour of each point. here is how to do so
plt.scatter(X_test.index,X_test.values,c=y_predict_test)
plt.show()
here is an a random example
the pointes coloured as yellow belong to class0 and points coloured in purple belong to class 1