Value Error while classifying for multidimensional output classes using SVMs - python

I am trying to fit & classify my data using SVMs.
My input data consists of 11 features (dimensions) with 1335 samples, and output data consists of 17 classes (1335x17)
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svccl = svclassifier.fit(x_train, y_train)
(and even for kernel = poly)
I get the following error:
ValueError: y should be a 1d array, got an array of shape (934, 17) instead.
Same error comes when I try to classify using Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(x_train, y_train)
gnb_predictions = gnb.predict(x_test)
Where am i wrong in my approach?

SVC and GaussianNB won't support multiple target variable classification.
Hence it won't accept anything else than 1d array to tackle that you would need to fit one classifier per target.
There is already API Multioutput classification
You can combine this with any classifier you want.
Combining Mulitoutput with SVC
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC
import numpy as np
X = np.random.rand(934, 100)
Y = np.random.randint(17, size = [934, 17])
n_samples, n_features = X.shape
svc = SVC()
multi_target_forest = MultiOutputClassifier(svc, n_jobs=-1)
multi_target_forest.fit(X, Y).predict(X)
Combining Mulitoutput with GaussianNB
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import GaussianNB
import numpy as np
X = np.random.rand(934, 100)
Y = np.random.randint(17, size = [934, 17])
n_samples, n_features = X.shape
gnb = GaussianNB()
multi_target_forest = MultiOutputClassifier(gnb, n_jobs=-1)
multi_target_forest.fit(X, Y).predict(X)

Related

How do I manually `predict_proba` from logistic regression model in scikit-learn?

I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())

Calculating AUC for LogisticRegression model

Let's take data
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
I want to create model using only first principal component and calculate AUC for it.
My work so far
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
But while I'm trying to use
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
Following error occurs :
y should be a 1d array, got an array of shape (569, 2) instead.
I tried to reshape my data
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
But it didn't solve the issue (it outputs) :
multilabel-indicator format is not supported
Do you have any idea how can I perform AUC on this first principal component?
You may wish to try:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
The problem is that predict_proba returns a column for each class. Generally with binary classification, your classes are 0 and 1, so you want the probability of the second class, so it's quite common to slice as follows (replacing the last line in your code block):
pred = clf.predict_proba(principalDf)[:, 1]

How to interpret base_value of multi-class classification problem when using SHAP?

I am using shap library for ML interpretability to better understand k-means segmentation algorithm clusters. In a nutshell I make some blogs, use k-means to cluster them and then take the clusters as label and xgboost to try to predict them. I have 5 clusters so it is a signle-label multi-class classification problem.
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
X, y = make_blobs(n_samples=500, centers=5, n_features=5, random_state=0)
data = pd.DataFrame(np.concatenate((X, y.reshape(500,1)), axis=1), columns=['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'cluster_id'])
data['cluster_id'] = data['cluster_id'].astype(int).astype(str)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.iloc[:,:-1])
kmeans = KMeans(n_clusters=5, **kmeans_kwargs)
kmeans.fit(scaled_features)
data['predicted_cluster_id'] = kmeans.labels_.astype(int).astype(str)
clf = xgb.XGBClassifier()
clf.fit(scaled_data.iloc[:,:-1], scaled_data['predicted_cluster_id'])
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(scaled_data.iloc[0,:-1].values.reshape(1,-1))
shap.force_plot(explainer.expected_value[0], shap_values[0], link='logit') # repeat changing 0 for i in range(0, 5)
The pictures above make sense as the class is '3'. But why this base_value, shouldn't it be 1/5? I asked myself a while ago a similar question but this time I set already link='logit'.
link="logit" does not seem right for multiclass, as it's only suitable for binary output. This is why you do not see probabilities summing up to 1.
Let's streamline your code:
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
from scipy.special import softmax, logit, expit
np.random.seed(42)
X, y_true = make_blobs(n_samples=500, centers=5, n_features=3, random_state=0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=5)
y_predicted = kmeans.fit_predict(X_scaled, )
clf = xgb.XGBClassifier()
clf.fit(X_scaled, y_predicted)
shap.initjs()
Then, what you see as expected values in:
explainer = shap.TreeExplainer(clf)
explainer.expected_value
array([0.67111245, 0.60223354, 0.53357694, 0.50821152, 0.50145331])
are base scores in raw space.
The multi-class raw scores can be converted to probabilities with softmax:
softmax(explainer.expected_value)
array([0.22229282, 0.20749694, 0.19372895, 0.18887673, 0.18760457])
shap.force_plot(..., link="logit") doesn't make sense for multiclass, and it seems impossible to switch from raw to probability and still maintain additivity (because softmax(x+y) ≠ softmax(x) + softmax(y)).
Should you wish to analyze your data in probability space try KernelExplainer:
from shap import KernelExplainer
masker = shap.maskers.Independent(X_scaled, 100)
ke = KernelExplainer(clf.predict_proba, data=masker.data)
ke.expected_value
# array([0.18976762, 0.1900516 , 0.20042894, 0.19995041, 0.21980143])
shap_values=ke.shap_values(masker.data)
shap.force_plot(ke.expected_value[0], shap_values[0][0])
or summary plot:
from shap import Explanation
shap.waterfall_plot(Explanation(shap_values[0][0],ke.expected_value[0]))
which are now additive for shap values in probability space and align well with both base probabilities (see above) and predicted probabilities for 0th datapoint:
clf.predict_proba(masker.data[0].reshape(1,-1))
array([[2.2844513e-04, 8.1287889e-04, 6.5225776e-04, 9.9737883e-01,
9.2762709e-04]], dtype=float32)

Trying to implement XGBoost into my Artificial Neural Network

I'm completely unaware as to why i'm receiving this error. I am trying to implement XGBoost but it returns with error "ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric." Even after i've One Hot Encoded my categorical data. If anyone knows what is causing this and a possible solution i'd greatly appreciate it. Here is my code written in Python:
# Artificial Neural Networks - With XGBoost
# PRE PROCESS
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding Categorical Data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1, 2])],
remainder = 'passthrough')
X = np.array(ct.fit_transform(X), dtype = np.float)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 0)
# Fitting XGBoost to the training set
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(x_train, y_train)
# Predicting the Test set Results
y_pred = classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

Naivebayes MultinomialNB scikit-learn/sklearn

I am bulding a naive bayes classifier and I follow the tutorial on the scikit-learn website.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
import string
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Importing dataset
data = pd.read_csv("test.csv", quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True,error_bad_lines=False)
df2 = data.set_index("name", drop = False)
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
train, test = train_test_split(df2, test_size=0.2)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(traintrain['review'])
test_matrix = count_vect.transform(testrain['review'])
clf = MultinomialNB().fit(X_train_tfidf, train['sentiment'])
The first argument is the vocabulary dictionary and it returns a Document-Term matrix.
What should be the second argument,twenty_train.target?
Edit Data example
Name, review,rating
film1,......,1
film2, the film is....,5
film3, film about..., 4
with this instruction I created a new column , if the rating is >3 so the review is positive, else it is negative
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
The fit method of MultinomialNB expects as input the x and y.
Now, x should be the training vectors (training data) and y should be the target values.
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
In more detail:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is
the number of features.
y : array-like, shape = [n_samples]
Target values.
Note: Make sure that shape = [n_samples, n_features] and shape = [n_samples] of x and y are defined correctly. Otherwise, the fit will throw an error.
Toy example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
newsgroups_train = fetch_20newsgroups(subset='train')
categories = ['alt.atheism', 'talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
vectorizer = TfidfVectorizer()
# the following will be the training data
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape
newsgroups_test = fetch_20newsgroups(subset='test',
categories=categories)
# this is the test data
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = MultinomialNB(alpha=.01)
# the fitting is done using the TRAINING data
# Check the shapes before fitting
vectors.shape
#(2034, 34118)
newsgroups_train.target.shape
#(2034,)
# fit the model using the TRAINING data
clf.fit(vectors, newsgroups_train.target)
# the PREDICTION is done using the TEST data
pred = clf.predict(vectors_test)
EDIT:
The newsgroups_train.target is just a numpy array that contains the labels (or targets or classes).
import numpy as np
newsgroups_train.target
array([1, 3, 2, ..., 1, 0, 1])
np.unique(newsgroups_train.target)
array([0, 1, 2, 3])
So in this example we have 4 different classes/targets.
This variable is needed in order to fit a classifier.

Categories

Resources