I have a problem with this code:
from sklearn import svm
model_SVC = SVC()
model_SVC.fit(X_scaled_df_train, y_train)
svm_prediction = model_SVC.predict(X_scaled_df_test)
The error message is
NameError
Traceback (most recent call
last) ~\AppData\Local\Temp/ipykernel_14392/1339209891.py in
----> 1 svm_prediction = model_SVC.predict(X_scaled_df_test)
NameError: name 'model_SVC' is not defined
Any ideas?
use:
from sklearn.svm import SVC
The line from sklearn import svm was incorrect. The correct way is
from sklearn.svm import SVC
The documentation is sklearn.svm.SVC. And when I choose this model, I'm mindful of the dataset size. Extracted:
The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using LinearSVC instead.
from sklearn.svm import LinearSVC
For more info you could read When should one use LinearSVC or SVC?
Related
I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features).
I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.
This occurs because LeaveOneGroupOut().split(X, y, groups=groups) returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X) masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.
I am on my second day of re-taking Python for the gazillionth time!
I am doing a tutorial on ML in Python, using the following code:
import sklearn.tree
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import tree
music_data = pd.read_csv('music.csv')
x = music_data.drop(columns=['genre'])
y = music_data['genre']
model = DecisionTreeClassifier()
model.fit(x,y)
tree.export_graphviz(model, out_file='music-recommender.dot',
feature_names=['age','gender'],
class_names= sorted(y.unique()),
label='all',
rounded=True,
filled=True)
I keep getting the following error:
ImportError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13088/3820271611.py in <module>
2 import pandas as pd
3 from sklearn.tree import DecisionTreeClassifier
----> 4 from sklearn.tree import tree
5
6 music_data = pd.read_csv('music.csv')
ImportError: cannot import name 'tree' from 'sklearn.tree' (C:\Anaconda\lib\site-packages\sklearn\tree\__init__.py)
I've tried to find a solution online, but I don't think it's the version of Python/Anaconda because I literally just installed both. I also don't think it's the sklearn.tree since I was able to import DecisionClassifer.
As this answer indicates, you're looking at some older code; this is always a risk with programming. But there's another thing you need to know about your code.
First off, scikit-learn contains several modules, and almost everything you need from it is in one of those. In my experience, most people import things like this:
from sklearn.tree import DecisionTreeRegressor # A regressor class.
from sklearn.tree import plot_tree # A helpful function.
from sklearn.metrics import mean_squared_error # An evaluation function.
It looks like the tutorial wants something similar to plot_tree(). This new-ish function is much easier to use than the older Graphviz visualization. So unless you really need the DOT file for some reasons, you should be able to do this:
from sklearn.tree import plot_tree
sklearn.tree.plot_tree(model)
Bottom line: there will probably be more broken things in that material. So if I were you I'd either make a new environment with a version of sklearn matching whatever material you're using... or ditch that material and look for something newer.
from sklearn.tree import tree looks wrong. Did you mean from sklearn import tree ?
According to the official Scikit Learn Decision Trees Documentation you really do not need too much of importing.
It can be done simply as follows:
from sklearn import tree
import pandas as pd
music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
y = music_data['genre']
model = tree.DecisionTreeClassifier()
model.fit(X,y)
I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
Error:
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?
It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\
EditedNearestNeighbours(),\
RepeatedEditedNearestNeighbours(),\
MultinomialNB())
I would like to train a voiting classifer in SciKit-Learn with three different classifiers. I'm having issues with the final step, which is printing the final accuracy scores of the classifiers.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
log_clf=LogisticRegression()
rnd_clf=RandomForestClassifier()
svm_clf=SVC()
voting_clf=VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],voting='hard')
voting_clf.fit(X_train, y_train)
I am getting errors when I run the following code:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_predict=clf.predict(X_test)
print(clf._class_._name_,accuracy_score(y_test,y_pred))
When I run this chunk of code I get the following:
AttributeError: 'LogisticRegression' object has no attribute '_class_'
I am assuming that calling 'class'is a bit outdated, so I changed class to 'classes_':
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.classes_._name_,accuracy_score(y_test,y_pred))
When I run this chunk of code I get the following:
AttributeError: 'numpy.ndarray' object has no attribute '_name_'
When I remove 'name' and run the following code, I still get an error:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.classes_,accuracy_score(y_test,y_pred))
Error:
NameError: name 'accuracy_score' is not defined
I'm not sure why accuracy_score is not defined seeing that imported the library
For the first error about class, you need to have two underscores here.
Change
print(clf._class_._name_,accuracy_score(y_test,y_pred))
to:
print(clf.__class__.__name__, accuracy_score(y_test,y_pred))
See this question for other ways to get the name of an object in python:
Getting the class name of an instance?
Now for the second error about 'accuracy_score' not defined, this happens when you have not imported the accuracy_score correctly. But I can see that in your code you have imported the accuracy_score. So are you sure that you are executing the line print(clf.__class__.__name__, accuracy_score(y_test,y_pred)) in the same file? or in any different file?
I want to use libsvm as a classifier for predicition. I have used the following code:
import numpy as np
import sklearn
from sklearn.svm import libsvm
X = np.array([[0,1.22,45,2.111,9.344,0], [0,1.5,25,5,1,0]])
y = np.array([0.0,1.0])
clf=sklearn.svm.libsvm
clf.fit(X,y)
print(clf.predict([1,1.12,42,4.223,2.33,0]))
I got following error:
File "sklearn/svm/libsvm.pyx", line 270, in sklearn.svm.libsvm.predict (sklearn/svm/libsvm.c:3917)
TypeError: predict() takes at least 6 positional arguments (1 given)
Is this the correct way? How can I resolve the error?
Basically use sklearn.svm.SVC, since as it is stated in the documentation of sklearn, SVC is based on libsvm:
class SVC(BaseSVC):
C-Support Vector Classification.
The implementation is based on libsvm. The fit time complexity
is more than quadratic with the number of samples which makes it hard
to scale to dataset with more than a couple of 10000 samples.