I am trying to practice using Sci-Kit Learn to do a K-Nearest Neighbor prediction model using the Iris data set. This is what I have written:
import sklearn
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
iris = datasets.load_iris()
X = iris.data
y = iris.target
knn =KNeighborsClassifier(n_neighbors=6)
knn.fit(X, y)
This is my output>>> KNeighborsClassifier(n_neighbors=6)
However, I think I should be getting: KNeighborsClassifer(algorithm = 'auto', leaf_size =30, metric ='minkowski, metric_params=None, n_jobs=1, n_neighbors=6, p=2, weights='uniform')
Also, I tried to predict the target value based on a new array of X values (X_new) as below:
X_new = np.array([[5.6,2.8,3.9,1.1],[5.7,2.6,3.8,1.3],[4.7,3.2,1.3,0.2]])
Pred = knn.predict(X_new)
print(Pred)
However, it didn't provide an output of anything at all. Any assistance/advice would be appreciated!
I think your code works fine considering I ran it on Google Colab (link to the notebook - https://colab.research.google.com/drive/1FROuNe4NMD6D2HCCEtz6TePlCccbGFZm?usp=sharing).
Do check this out maybe you try reproducing the error.
Related
I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())
I am using shap library for ML interpretability to better understand k-means segmentation algorithm clusters. In a nutshell I make some blogs, use k-means to cluster them and then take the clusters as label and xgboost to try to predict them. I have 5 clusters so it is a signle-label multi-class classification problem.
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
X, y = make_blobs(n_samples=500, centers=5, n_features=5, random_state=0)
data = pd.DataFrame(np.concatenate((X, y.reshape(500,1)), axis=1), columns=['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'cluster_id'])
data['cluster_id'] = data['cluster_id'].astype(int).astype(str)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.iloc[:,:-1])
kmeans = KMeans(n_clusters=5, **kmeans_kwargs)
kmeans.fit(scaled_features)
data['predicted_cluster_id'] = kmeans.labels_.astype(int).astype(str)
clf = xgb.XGBClassifier()
clf.fit(scaled_data.iloc[:,:-1], scaled_data['predicted_cluster_id'])
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(scaled_data.iloc[0,:-1].values.reshape(1,-1))
shap.force_plot(explainer.expected_value[0], shap_values[0], link='logit') # repeat changing 0 for i in range(0, 5)
The pictures above make sense as the class is '3'. But why this base_value, shouldn't it be 1/5? I asked myself a while ago a similar question but this time I set already link='logit'.
link="logit" does not seem right for multiclass, as it's only suitable for binary output. This is why you do not see probabilities summing up to 1.
Let's streamline your code:
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
from scipy.special import softmax, logit, expit
np.random.seed(42)
X, y_true = make_blobs(n_samples=500, centers=5, n_features=3, random_state=0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=5)
y_predicted = kmeans.fit_predict(X_scaled, )
clf = xgb.XGBClassifier()
clf.fit(X_scaled, y_predicted)
shap.initjs()
Then, what you see as expected values in:
explainer = shap.TreeExplainer(clf)
explainer.expected_value
array([0.67111245, 0.60223354, 0.53357694, 0.50821152, 0.50145331])
are base scores in raw space.
The multi-class raw scores can be converted to probabilities with softmax:
softmax(explainer.expected_value)
array([0.22229282, 0.20749694, 0.19372895, 0.18887673, 0.18760457])
shap.force_plot(..., link="logit") doesn't make sense for multiclass, and it seems impossible to switch from raw to probability and still maintain additivity (because softmax(x+y) ≠ softmax(x) + softmax(y)).
Should you wish to analyze your data in probability space try KernelExplainer:
from shap import KernelExplainer
masker = shap.maskers.Independent(X_scaled, 100)
ke = KernelExplainer(clf.predict_proba, data=masker.data)
ke.expected_value
# array([0.18976762, 0.1900516 , 0.20042894, 0.19995041, 0.21980143])
shap_values=ke.shap_values(masker.data)
shap.force_plot(ke.expected_value[0], shap_values[0][0])
or summary plot:
from shap import Explanation
shap.waterfall_plot(Explanation(shap_values[0][0],ke.expected_value[0]))
which are now additive for shap values in probability space and align well with both base probabilities (see above) and predicted probabilities for 0th datapoint:
clf.predict_proba(masker.data[0].reshape(1,-1))
array([[2.2844513e-04, 8.1287889e-04, 6.5225776e-04, 9.9737883e-01,
9.2762709e-04]], dtype=float32)
I am a beginner to machine learning and as part of learning I choose student performance dataset from UCI. I want to predict the final result of a student based on the features given.
I first tried using two main and highly correlated features G1 and G2 that are grades of two exams. I used LinearRegression algorithm and got an accuracy of 0.4 or less.
Then I tried feature engineering on all the features that are objects in dataframe and still the accuracy is same.
How can I improve accuracy score ?
My code as a Python notebook
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error,accuracy_score
df = pd.read_csv('student-mat.csv',sep=';')
df2 = pd.read_csv('student-por.csv',sep=';')
df = [df,df2]
df = pd.concat(df)
df = pd.get_dummies(df)
X = df.drop('G3',axis=1)
y = df['G3']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
y_pred = [int(round(i)) for i in y_pred]
accuracy_score(y_test,y_pred)
The accuracy calculated on continous variables is not very useful. You can use the mean squared error instead, which is relevant for continuous output.
As for improving your model, you can try to use the different tools at your disposal to identify the most relevant features. I recommend statsmodels API (https://www.statsmodels.org/stable/regression.html) to get a more in-depth analysis.
I am working with the yeast dataset available at:
http://archive.ics.uci.edu/ml/datasets/yeast
and I want to make a neural network classifier model and plot the learning curves. So, I have used the model_selection of scikit twice; one for making the training and testing set and once more for selecting the validation set. From these two sets I would like to plot the learning curves, my code is the following:
import numpy as np
import pandas as pd
from sklearn import model_selection, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
def readFile(file):
head=["seq_n","mcg","gvh","alm","mit","erl","pox","vac","nuc","site"]
f=pd.read_csv(file,delimiter=r"\s+")
f.columns=head
return f
def NeuralClass(X,y):
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=0.2)
X_tr,X_val,y_tr,y_val=model_selection.train_test_split(X_train,y_train,test_size=0.2)
mlp=MLPClassifier(activation="relu",max_iter=3000)
mlp.fit(X_train,y_train)
print (mlp.score(X_train,y_train))
plt.plot(mlp.loss_curve_)
mlp.fit(X_val,y_val)
plt.plot(mlp.loss_curve_)
def main():
f=readFile("yeast.data")
list=["seq_n","site"]
X=f.drop(list,1)
y=f["site"]
NeuralClass(X,y)
if __name__=="__main__":
main()
I have obtained a graph like the following which I do not know if it´s correct:
The question is if this would be the correct way to plot the validation curve or if the method I followed is the right one.
Thanks
Didn't test it, but should be something like this:
def NeuralClass(X,y):
X_train,X_test,y_train,y_test = model_selection.train_test_split(
X,y,test_size=0.2)
mlp=MLPClassifier(
activation="relu",
max_iter=3000,
validation_fraction=0.2,
early_stopping=True)
mlp.fit(X_train,y_train)
print (mlp.score(X_train,y_train))
plt.plot(mlp.loss_curve_)
plt.plot(mlp.validation_scores_)
I encounter some problems using GridSerachCV from sklearn package and parallelization.
To see if it was coming from my data, I tried to perform the same kind or processing on the iris dataset embedded in sklearn.
I try to optimize a SVM classifier for different parameter values. When I set n_jobs=1 there is no problem, it is working well (the following code runs in about 0.5s). But if I just change n_jobs to other values, then it runs indefinitely (I did not let it run several hours, but after more than 10 minutes I still don't get any result).
Here is the code with n_jobs=3:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
from sklearn import datasets
from sklearn import grid_search
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = StandardScaler().fit_transform(X)
parameters = {'C': [0.1,0.3,0.5,0.7,1.0],\
'coef0': [0.1,0.3,0.5,0.7,0.9],\
'degree': [2,3,4],\
'gamma': [0.0,0.1,0.2]}
clf = SVC(random_state=0, max_iter=-1, probability=False,shrinking=True, verbose=False,\
class_weight='auto',tol=0.0001, kernel='poly')
gridsearch_RF = grid_search.GridSearchCV(clf, parameters, n_jobs=3, cv=2, verbose=0)
gridsearch_RF = gridsearch_RF.fit(X,y)
print gridsearch_RF.best_score_
print gridsearch_RF.best_params_
I also tried with another classifier (RandomForest) and I get the same problem.
Does anyone have had the same problem?
(I use WinPython-64bits-2.7.9.5 and sklearn version 0.16.1)