ROC curve for each class - python

Code:
df = pd.read_csv(r"model_data_TBI_3(o).csv", index_col= ["X","Y"])
X = df.drop("class", axis= "columns")
y = df["class"]
array = y.values
y = array[: :]
y = label_binarize(y, classes= [1,2,3], pos_label= 1, neg_label= 0)
n_classes = y.shape[1]
print(n_classes)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=45, train_size=0.10)
mm = MinMaxScaler()
mm.fit(X_train)
X_train_mm = mm.transform(X_train)
X_test_mm = mm.transform(X_test)
svm_tuned = SVC(kernel= "rbf", C=2.33, gamma= 0.00046, probability= True,
break_ties= False, decision_function_shape="ovo",
random_state=1, shrinking=True, tol=0.12,
class_weight={1:100, 1:75, 2:55})
y_score = svm_tuned.fit(X_train_mm, y_train).decision_function(X_test_mm)
Result:
ValueError: y should be a 1d array, got an array of shape (329, 3) instead.
I want to plot the ROC curve of my svm model. My data has 1,2,3 classes. I binarized them using label binarizer but still getting this error. The traceback error is at y_score. My request to those who wanna clear my doubt or solution please dont send me the code of iris data i can look it up on sklearn website just give me explanation or write your code as a solution to this problem. I am really thankful to those who will answer. Sorry if i make any mistake in posting questions i am new to stackoverflow and new to python & machine learning.
Thank you

Related

Fit linear model on all data after doing it just on training data in order to make predictions about future data

In linear regression (but in general on any other model) is there a need to use .fit() on all input data after using it only on training data? I'll try to explain it better with an example. I will artificially create a trivial set of input and output data:
import numpy as np
from random import uniform
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
X = np.arange(0, 500, 5).reshape(-1, 1)
y = np.zeros(100)
for i in range(len(X)):
y[i] = 3 * X[i] + uniform(-10, 10)
X_train = X[:80]
y_train = y[:80]
X_test = X[80:]
y_test = y[80:]
reg = LinearRegression()
It could be seen as a time series (X contains the time periods).
Now, I train the model using the training sets:
reg.fit(X_train, y_train)
and I make predictions using the testing set:
y_pred = reg.predict(X_test)
Using r2_score(y_test, y_pred) I get about 0.99, so my model is able to fit the data correctly.
Now my question: I want to make a prediction about the future, i.e. about a series of data not contained in X, let's say:
X_future = np.arange(500, 525, 5).reshape(-1, 1)
Should I proceed this way:
y_future = reg.predict(X_future)
or in this:
reg.fit(X, y)
y_future = reg.predict(X_future)
In other words, do I have to fit again the model found earlier on the entire (X, y) before making a prediction of the future? Are there any chances that both approaches are correct?
Thanks to anyone who will answer me.

What KNN accuracy is shown here?

I use the knn algorithm from the sklearn library. In the last code line, I get a result for accurency with 95,5%. Does that mean, that 95,5% of my test dataset X_test is right/correct predicted?
Here I short part of my script.
# Model, predict and solve
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
How can I see the wrong predicted rows?
Thank you!
Your predictions are in the Y_pred variable. Print it along with the y_test variable. You can then compare and see which observations are correctly predicted and which are incorrectly classified.
PS: From your code snippet, its not clear whether you do have y_test variable.

Error when calculating predicted values of polynomial regression python

I am trying to calculate predicted values after running a polynomial regression in Python using the following code:
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
X = X_train.reshape(-1, 1)
X_predict = np.linspace(0, 10, 100)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X)
model = LinearRegression()
reg_poly = model.fit(X_train_poly, y_train)
y_predict = model.predict(X_predict)
After running it I get the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0.1010101 0.2020202 0.3030303 0.4040404 0.50505051 ......
Reshape your data either using array.reshape(-1, 1)
if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried reshaping the array as was said in the error message, so the last line of code would be:
y_predict = model.predict(X_predict.reshape(-1,1))
But as a result I got this error:
ValueError: shapes (100,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
Can someone please explain what I am doing wrong?
You forgot to prepare data for your prediction in the same way you prepared training data for the model. In particular, you forgot to fit_transform your X_predict with PolynomialFeatures.
Since the shape of data you used to predict have to exactly match the shape used for training, you need to recreate all you did for X_train_poly (you used that for training) for X_predict. Therefore your line should look like:
y_predict = model.predict(poly.fit_transform(X_predict.reshape(-1, 1)))

Function for cross validation and oversampling (SMOTE)

I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1). y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE.
def Learning(n, est, X, y):
s_k_fold = StratifiedKFold(n_splits = n)
acc_scores = []
rec_scores = []
f1_scores = []
for train_index, test_index in s_k_fold.split(X, y):
X_train = X[train_index]
y_train = y[train_index]
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
X_test = X[test_index]
y_test = y[test_index]
est.fit(X_resampled, y_resampled)
y_pred = est.predict(X_test)
acc_scores.append(accuracy_score(y_test, y_pred))
rec_scores.append(recall_score(y_test, y_pred))
f1_scores.append(f1_score(y_test, y_pred))
print('Accuracy:',np.mean(acc_scores))
print('Recall:',np.mean(rec_scores))
print('F1:',np.mean(f1_scores))
Learning(3, SGDClassifier(), X_train_s_pca, y_train)
When I run the code, I get the below error:
None of [Int64Index([ 4231, 4235, 4246, 4250, 4255, 4295, 4317,
4344, 4381,\n 4387,\n ...\n 13122,
13123, 13124, 13125, 13126, 13127, 13128, 13129, 13130,\n
13131],\n dtype='int64', length=8754)] are in the [columns]"
Help to make it run is appreciated.
If you observe the error stack trace (which is important but you don't include) carefully, you should see that the error comes from these line (and will come from other similar lines):
X_train = X[train_index]
This way of selecting rows only applicable for Numpy array. Since you are using Pandas DataFrame, you should use loc:
X_train = X.loc[train_index]
Alternatively, you can convert the DataFrame to Numpy array instead (to minimize code change) by using values:
Learning(3, SGDClassifier(), X_train_s_pca.values, y_train.values)

Python sklearn polynomial preprocessing and dimensional problems

I am experimenting the fit of 1-3 degree polynomial transformation to the original data using 100 predicted values each. I first 1) reshaped the original data, 2) applied fit_transform on the test set and prediction space (of data features), 3) obtained linear prediction on the prediction space, and 4) exported them into an array, using the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
pred_data = np.linspace(0,10,100).reshape(-1,1)
results = []
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
pred_data = poly.fit_transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data)
results.append(pred)
results
However, I did not get what I wanted, Python did not return an array of (3, 100) shape as I was expecting and, in fact, I received an error message
ValueError: shapes (100,10) and (4,1) not aligned: 10 (dim 1) != 4 (dim 0)
Seems to be a dimensional problem resulting either from "reshape" or from the "fit_transform" step. I got confused as this was supposed to be straightforward test. Would anyone enlighten me on this? It will be much appreciated.
Thank you.
Sincerely,
First, as I suggested in comment, you should always call just transform() on test data (pred_data in your case).
But even if you do that, a different error occurs. The error is due to this line:
pred_data = poly.fit_transform(pred_data)
Here you are replacing the original pred_data with the transformed version. So for first iteration of loop, it works, but for second and third iteration it becomes invalid, because it requires the original pred_data of shape (100,1) defined in this line above the for loop:
pred_data = np.linspace(0,10,100).reshape(-1,1)
Change the name of variable inside the loop to something else and all works well.
for i in [1, 2, 3] :
poly = PolynomialFeatures(degree = i)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_poly1 = poly.fit_transform(x_train)
# Changed here
pred_data_poly1 = poly.transform(pred_data)
linreg1 = LinearRegression().fit(x_poly1, y_train)
pred = linreg1.predict(pred_data_poly1)
results.append(pred)
results

Categories

Resources