confusion matrix report accuracy problem jupyter - python

I want to plot a confusion matrix to visualize the classifer's performance, but it accuracy and recall does not show
Accuracy Screenshot

I don't see any data here, or any code either. Anyway, this works for me.
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
X, y = make_classification(n_samples=1000, n_features=30,
n_informative=12,
n_clusters_per_class=1, n_classes=10,
class_sep=2.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y)
clf = LogisticRegression(max_iter=1000, random_state=42).fit(X_train, y_train)
df = pd.DataFrame(classification_report(clf.predict(X_test),
y_test, digits=2,
output_dict=True)).T
df['support'] = df.support.apply(int)
df.style.background_gradient(cmap='viridis',subset=pd.IndexSlice['0':'9', :'f1-score'])
import seaborn as sns
sns.heatmap(df, annot=True)

Related

KNN: why is my variable is not defined in python?

I am working on an assignment and I run into this error. I am using python to perform an KNN on a data set. I pretty sure I defined the variable but it says otherwise. This code is written below.
`
import pandas as PD
import numpy as np
import matplotlib.pyplot as mtp
data_set= PD.read_csv('hw6.data.csv.gz')
x= data_set.iloc[:,[2,3]].valuesS
y= data_set.iloc[:, 4].values
from sklearn.model_selection import train_test_split
x_train, x_train, y_train, y_train= train_test_split(x,y, test_size=.25, random_state=0)
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
`
`
import pandas as PD
import numpy as np
import matplotlib.pyplot as mtp
data_set= PD.read_csv('hw6.data.csv.gz')
x= data_set.iloc[:,[2,3]].valuesS
y= data_set.iloc[:, 4].values
from sklearn.model_selection import train_test_split
x_train, x_train, y_train, y_train= train_test_split(x,y, test_size=.25, random_state=0)
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
`
The error says "x_test" is not defined Pylance (reportUndefinedVarible)

Why Score from LinearRegression is giving different result than r2_score from sklearn.metrics

Ideally I should get same result as score is nothing but R-Square. But not sure why results are coming different.
from sklearn.datasets import california_housing
data = california_housing.fetch_california_housing()
data.data.shape
data.feature_names
data.target_names
import pandas as pd
house_data = pd.DataFrame(data.data, columns=data.feature_names)
house_data.describe()
house_data['Price'] = data.target
X = house_data.iloc[:, 0:8].values
y = house_data.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
#Check R-square on training data
from sklearn.metrics import mean_squared_error, r2_score
y_pred = linear_model.predict(X_test)
print(linear_model.score(X_test, y_test))
print(r2_score(y_pred, y_test))
Output
0.5957643114594776
0.34460597952465033
from the docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
sklearn.metrics.r2_score(y_true, y_pred,...)
You are passing y_true and y_pred the wrong way around. If you switch them you get the correct result.
print(linear_model.score(X_test, y_test))
print(r2_score(y_test, y_pred))
0.5957643114594777
0.5957643114594777

How to avoid Collection Error Python Numpy

I am trying to train a Linear Regression Qualifier to continue a grap.
I have a couple of thousand lines of data in my csv file that I import into numpy arrays. Here is my code :
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import csv
import math
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
def predict():
sample_data = pd.read_csv("includes\\csv.csv")
x = np.array(sample_data["day"])
y = np.array(sample_data["balance"])
for x in x:
x = x.reshape(1, -1)
#lol
for y in y:
y.reshape(1, -1)
#lol
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
clf = LinearRegression()
clf.fit(x_train, y_train)
clf.score(x_test, y_test)
When I run this, the error is:
TypeError: Singleton array 6014651 cannot be considered a valid collection.
Any ideas why that's a thing?
After discussion in comments:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import csv
import math
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
def predict():
sample_data = pd.read_csv("includes\\csv.csv")
x = np.array(sample_data["day"])
y = np.array(sample_data["balance"])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
X_train, X_test should be capitals, python variables are case sensitive

Cross validation and standaridization in skitlearn

I would like to find the accuracy of a sklearn classifier with K-cross validation. I can estimate the accuracy normally without cross-validation. However, how can I improve this code to do cross validation and apply a StandardScaler at the same time?
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.pipeline import Pipeline
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)
pipe_lrSVC = Pipeline([('scaler', StandardScaler()), ('clf', svm.LinearSVC())])
pipe_lrSVC.fit(X_train, y_train)
y_pred = pipe_lrSVC.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
Simply use the pipeline as the estimator input to cross_val_score:
cross_val_score(pipe_lrSVC, iris.data, iris.target, cv=5)

Scikit-Learn: Adjust train_size or test_size?

This is a question regarding best practices for sklearn.
While experimenting with SVMs using the iris dataset provided in the sklearn library. While using train_test_split, I was wondering which parameter to adjust to avoid overfitting. I was taught to adjust test_size (roughly to ~0.3), but there is a train_size parameter. Would it not make sense to adjust the train_size to avoid overfitting, or am I misunderstanding something here?
I get similar results regardless of which parameter I adjust, but I don't know if that's always the case.
Appreciate any help. Thanks!
Here is the code I am currently working with:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
scaler = StandardScaler()
scaler.fit(df)
scaled_df = scaler.transform(df)
df = pd.DataFrame(data=scaled_df, columns=iris.feature_names)
x = df
y = iris.target
#test_size is used here, but is swapped with train_size to experiment
x_train, x_test, y_train, y_test = tts(x, y, test_size=0.33)
c_param = np.arange(1, 100, 10)
gamma_param = np.arange(0.0001, 1, 0.001)
params = {'C':c_param, 'gamma':gamma_param}
grid = GridSearchCV(estimator=SVC(), param_grid=params, verbose=0)
grid_fit = grid.fit(x_train, y_train)
grid_pred = grid.predict(x_test)
print(grid.best_params_)
print('\n')
print("Number of training records: ", len(x_train))
print("Number of test records: ", len(x_test))
print('\n')
print(classification_report(y_true=y_test, y_pred=grid_pred))
print('\n')
print(confusion_matrix(y_true=y_test, y_pred=grid_pred))

Categories

Resources