plot calibration curve for machine learning - python

I have the code below and this code work only with the binary class so how can I use with three classes.
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import scikitplot as skp
orgnal_data = pd.read_excel("movie.xls")
# Program extracting first column
text = orgnal_data.iloc[:,0]
lable = orgnal_data.iloc[:,1]
x_train,x_test,y_train,y_test=train_test_split(fe,lable,test_size=0.30,random_state=40)
DT = DecisionTreeClassifier()
DT_y = DT.fit(x_train,y_train).predict(x_test)
clf_names = ['Decision Tree']
skp.metrics.plot_calibration_curve(y_test,DT_y,clf_names)
plt.show()

Since you use scikit-plot module, there is no function for a multiclass problem.
Read the source code here:
This function currently only works for binary classification.
So you can either 1) modify the source code or 2) open a github issue and request a function for multiclass problems.
EDIT 1:
Using scikit-learn you have some ML models that can handle multiclass problems. For example for the LinearSVC function here, the multiclass support is handled according to a one-vs-the-rest scheme.
So you can actually have models like this and then use the plot_calibration_curve function for each case (one VS rest) separately.

Related

How to pickle or otherwise save an RFECV model after fitting for rapid classification of novel data

I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features).
I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.
This occurs because LeaveOneGroupOut().split(X, y, groups=groups) returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X) masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.

Import error. Can't import kmeans_plusplus

I'm trying to replicate an example of a clustering model with scikit-learn:
import sklearn
sklearn.__version__
Returns:
'0.23.2'
And:
from sklearn.cluster import kmeans_plusplus
Returns the Error message:
ImportError: cannot import name 'kmeans_plusplus' from 'sklearn.cluster' (C:\Users\sddss\anaconda3\lib\site-packages\sklearn\cluster\__init__.py)
According to the documentation, kmeans_plusplus is
New in version 0.24.
so it is not available for the version 0.23.2 you are using.
Nevertheless, this should not be a real issue; the only difference between the "good old" K-Means already available in scikit-learn is the initialization of the cluster centers according to the kmeans++ algorithm; and this is already available in the standard KMeans. From the standard KMeans documentation regarding the init argument:
'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence
So, what you need to do instead is simply to use the "vanilla" KMeans of scikit-learn with the argument init='kmeans++':
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_clusters, init='kmeans++')
There is no kmeans_plusplus class or module for version 0.23.2. You need to import KMeans and set the init key word argument to kmeans++ to obtain the behaviour you want
from sklearn.cluster import KMeans
kmeans = KMeans(init='k-means++')

Simple question I am not getting output as expected.(Linear regression)

I am new to programming. Currently, I am learning machine learning from this video.
This is related to linear regression
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv('homeprices.csv')
reg = linear_model.LinearRegression()
Problem 1
reg.fit(df[['area']],df.price)
Expected output should be
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
My output:
LinearRegression()
Problem 2
reg.predict(3300)
It's giving error when I use "()" but when I use 2D array "[[]]" It is giving me correct output, But I want to know why It is not giving me output(as shown in video) when I use the only parenthesis.
Problem 1 :
This is how the fitted model outputs are shown in the newest version of sklearn, i.e., 0.23. The parameters are the same, but they are not shown in the output.
You can use reg.get_params() to view the parameters.
Problem 2 :
Newer versions of Scikit-learn require 2D inputs for the predict function and we can make 3300 2D by [[3300]]:
reg.predict( [[3300]] )
Problem1:
it depends on the default parameters which you might have changed it before or any other reason which has changed it, but you can easily set your desired parameters while you are initializing the Linear classifier in this way:
reg = linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Problem 2:
reg.predict(3300) it's not correct to pass the parameter to Pandas in that way and you can see that the instructor has also corrected his answer to the reg.predict([3300]) in the description of the youtube Post
try this but you should define your variable and fit them to get desired output
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression()
df=pd.read_csv('homeprices.csv')
reg =LinearRegression()

How to use the imbalanced library with sklearn pipeline?

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
Error:
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?
It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\
EditedNearestNeighbours(),\
RepeatedEditedNearestNeighbours(),\
MultinomialNB())

Leave one out Cross validation using sklearn (Multiple CSV)

I have 52 CSV files in a folder. I want to build a model based on this data. That's why I want to do Leave one out cross-validation on these data. How can I do this using sci-kit learn in python?
I tried from sci kit document and also search many resources.But I didn't found the solution. I have tried this code.
import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import LeaveOneOut
path=r'...................\Data\New design process data'
filelist=glob.glob(path + "/*.csv")
loo=LeaveOneOut()
for train,test in loo.split(filelist):
print("%s %s" % (train, test))
But it showed errors.
init() missing 1 required positional argument: 'n'
I am new in python as well as sci-kit learn. If anyone can help me, It would be a great convenience.
You should use the newer version of the module, which is located in sklearn.model_selection instead of sklearn.cross_validation. (The cross_validation module was depricated in 0.18.) Using this version, you can instantiate the class without the positional argument, and it also does not fail when you try to call split.
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut() # works without passing an argument
loo.get_n_splits(X) # returns 2

Categories

Resources