plotting train_test_split while maintining their indices

plotting train_test_split while maintining their indices - python

I am trying to plot the train_test _split while maintaining the indices, here is my code.
#df.insert(0, 'x', range(0, 0 + len(df)))
X_train, X_test, y_train, y_test = train_test_split(x, y,
test_size = .1)
regressor = RandomForestClassifier()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
plt.plot(X_train,y_pred_train,'bo')
plt.show()
It seems like the y_pred are plotting against the incorrect x_axis value as there is a huge gap in the middle of the data and some overlapping
How can I make the corresponding x_value of the y_pred and y_pred_train be in their original position from the data frame.

you will need to include the index in the plot. usually y will be represented as the colour of each point. here is how to do so
plt.scatter(X_test.index,X_test.values,c=y_predict_test)
plt.show()
here is an a random example
the pointes coloured as yellow belong to class0 and points coloured in purple belong to class 1

Related

RandomForestRegressor predict values of 2D array

I have irradiated some radiochromic film at different doses and scanned them in as a Tiff, 48 bit RGB to use as a calibration. My dataset comprises of three columns representing the colour channels (X) and the corresponding dose (y). The input data is of the form
I have used the sklearn RandomForestRegressor to train and test successfully. When I use the following (net optical density RGB values)
X = df[['RedDensity', 'GreenDensity', 'BlueDensity']].values
y = df['Dose'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
regressor.predict(np.array([0.349, 0.296, 0.107]).reshape(1, 3))
I get a predicted value for the dose. My question is how do I predict the values of a test image of (m x n x 3)? I could scan across the test image and read the RGB value of each element but is there a more elegant way?
Thanks
James

Incorrect labels in confusion matrix

I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.
The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:
Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:
data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)
X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
Plot confusion matrix
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(clf, X_test, Y_test,
display_labels=Y,
cmap=plt.cm.Blues,)
Confusion matrix

The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(8,8))
disp = plot_confusion_matrix(clf, X_test, Y_test,
labels=np.unique(y),
cmap=plt.cm.Blues,ax=ax)

Scaled data with SVM causing strange Confusion Matrix

I have a dataset with X.shape (104481, 34) and y.shape (104481,), and I want to train an SVM model on it.
The steps I do are (1) Split data, (2) Scale data, and (3) Train SVM:
(1) Split data:
Function:
from sklearn.model_selection import train_test_split
def split_data(X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12, stratify=y)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = split_data_set.split_data(X,y)
The 4 classes are the following. The data set is quite imbalanced, but that is an issue for later.
y_train.value_counts()
out:
Status_9_Substatus_8 33500
Other 33500
Status_62_Substatus_7 2746
Status_62_Substatus_30 256
Name: Status, dtype: int64
y_test.value_counts()
out:
Status_9_Substatus_8 16500
Other 16500
Status_62_Substatus_7 1352
Status_62_Substatus_30 127
Name: Status, dtype: int64
(2) Scale data:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_train_scaled.shape)
print(y_train.shape)
(3) Train and predict with SVM:
svm_method.get_svm_model(X_train_scaled, X_test_scaled, y_train, y_test)
Calling this method:
def get_svm_model(X_train, X_test, y_train, y_test):
print('Loading...')
print('Training...')
svm, y_train_pred, y_test_pred = train_svm_model(X_train,y_train, X_test)
print('Training Complete')
print('Plotting Confusion Matrix...')
performance_measure.plot_confusion_matrix(y_test,y_test_pred, normalize=True)
print('Plotting Performance Measure...')
performance_measure.get_performance_measures(y_test, y_test_pred)
return svm
Which calls this method:
def train_svm_model(X_train,y_train, X_test):
#
svm = SVC(kernel='poly', gamma='auto', random_state=12)
# Fitting the model
svm.fit(X_train, y_train)
# Predicting values
y_train_pred = svm.predict(X_train)
y_test_pred = svm.predict(X_test)
return svm, y_train_pred, y_test_pred
The resulting '''Output''' is this screenshot.
What is strange is that there are samples from all four classes present (since I used the stratify parameter when calling train_test_split), however, it looks like some of the classes disappear?
The SVM and confusion matrix functions worked well with a toy data set:
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.DataFrame(data.target)
y = np.array(y)
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
svm, y_train_pred, y_test_pred = train_svm_model(X_train, y_train, X_test)
get_svm_model(X_train, X_test, y_train, y_test)
Any idea what is going on here?
Thanks in advance.
The CM code:
def plot_confusion_matrix(y_true, y_pred,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
classes = unique_labels(y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
plt.show()
return ax

Your confusion matrix is not zero:
If we look at this on the x-axis you have the predicted label and on the y-axis the true labels, lets have a look at the third row from the top:
0.94: 0.94 of the true label: Status_62_Substatus_7 are predicted as class other, which is wrong
0.00 of the same true label are predicted also wrong
0.00 of the same true label are predicted wrong (this should be the correct predict value (higher is better)
0.06 are predicted again wrong
Is your problem is so imbalanced you just have 0 predictions for two labels.

Linear Regression - Predict ŷ

I'm trying to plot a scatter plot of the values of actual sales (y) and predicted sales (ŷ).
I have imported the csv file and currently the codes I have for the linear regression model is:
result = smf.ols('sales ~ discount + holiday + product', data=data).fit()
print(result.summary())
Since, I only have the actual sales values, how do I find the predicted sales (ŷ) values to plot the scatter plot? I have tried researching and found lm.predict() and result.predict(). Is there a difference? lm = LinearRegression()
Thank you in advance!

Without data it is hard to help, but I guess you have X and y from dataset because you want to perform linear regression. You can split data into training and test set using scikit-learn:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3)
Then you need to fit linear regression to the training set:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
and afterwards predict test set results:
y_pred = regressor.predict(X_test)
Finally, you can plot your test or training results:
# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Training set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Test set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
(In this scenario we want to predict how many Sales will be if we set specific value of e.g. Discount percentage). If you have more than one X parameter, things are more complicated and you will need to use dummy variables, perform statistical analysis etc..

Deviance loss scores on the training data don't match clf.train_score_

TL;DR: I'm trying to understand the meaning of the train_score_ attribute of a GradientBoostingClassifier, and specifically why it doesn't match my following attempt to calculate it directly:
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
More details: I'm interested in the loss scores for both the test and the train data during the different fit stages of the classifier. I can use staged_predict and loss_ to calculate the loss scores for the test data:
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
I'm okay with that. My problem is with the train loss scores. The documentation suggests to use clf.train_score_:
The i-th score train_score_[i] is the deviance (= loss) of the model
at iteration i on the in-bag sample. If subsample == 1 this is the
deviance on the training data.
yet these clf.train_score_ values do not match my attempt to calculate them directly in my_train_scores above. What am I missing here?
The code I used:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = GradientBoostingClassifier(n_estimators=5, loss='deviance')
clf.fit(X_train, y_train)
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
print test_scores
print clf.train_score_
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
print my_train_scores, '<= NOT the same values as in the previous line. Why?'
Producing e.g. this output...
[0.71319004170311229, 0.74985670836977902, 0.79319004170311214, 0.55385670836977885, 0.32652337503644546]
[ 1.369166 1.35366377 1.33780865 1.32352935 1.30866325]
[0.65541226392533436, 0.67430115281422309, 0.70807893059200089, 0.51096781948088987, 0.3078567083697788] <= NOT the same values as in the previous line. Why?
...where the last two rows do not match.

The attribute self.train_score_ is recreated in the following way:
test_dev = []
for i, pred in enumerate(clf.staged_decision_function(X_test)):
test_dev.append(clf.loss_(y_test, pred))
ax = plt.gca()
ax.plot(np.arange(clf.n_estimators) + 1, test_dev, color='#d7191c', label='Test', linewidth=2, alpha=0.7)
ax.plot(np.arange(clf.n_estimators) + 1, clf.train_score_, color='#2c7bb6', label='Train', linewidth=2, alpha=0.7, linestyle='--')
ax.set_xlabel('n_estimators')
plt.legend()
plt.show()
See the result below. Note that the curves are on top of each other as the training and test data are the same data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

plotting train_test_split while maintining their indices - python

Related

RandomForestRegressor predict values of 2D array

Incorrect labels in confusion matrix

Scaled data with SVM causing strange Confusion Matrix

Linear Regression - Predict ŷ

Deviance loss scores on the training data don't match clf.train_score_

Categories

Resources