I am new to implementing Machine Learning in Python and am currently trying out KNN classification following YouTube tutorials. Here is the code.
import numpy as np
#from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
import pandas as pd
df=pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?', -99999, inplace=True)
df.drop(['id'],1,inplace=True)
X=np.array(df.drop(['class'],1))
y=np.array(df['class'])
X_train, X_test, y_train, y_test=cross_validate.train_test_split(X,y,test_size=0.2)
I get the following error: -
X_train, X_test, y_train, y_test=cross_validate.train_test_split(X,y,test_size=0.2)
AttributeError: 'function' object has no attribute 'train_test_split'
I tried importing train_test_split as
from sklearn.model_selection import train_test_split
but then I get the same error. Any help is appreciated. Thanks!
train_test_split is a separate module (docs), and it is not to be used in combination with cross_validate; the correct usage here is (assuming scikit-learn v0.20):
from sklearn.model_selection import train_test_split
# [...]
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2)
sklearn.cross_validation deprecated in version 0.20.
Use sklearn.model_selection.train_test_split
Related
Assume that I want to apply several feature selection methods using sklearn pipeline. An example is provided below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
])
X_new = fs_pipeline.fit_transform(X_train, y_train)
I get the selected features using fit_transform method. If I use fit method on pipeline, I will get pipeline object.
Now, assume that I want to add a ML model to the pipeline like below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
If I use fit_transform method in the above code (model.fit_transform(X_train, y_train)), I get the error:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'
So. I should use model.fit(X_train, y_train). But, how can I be sure that pipeline applied fit_transform method for feature selection steps?
A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform()). You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.
Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier).
Whenever the last step is made of an estimator rather than a transformer, as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform().
Summing up,
case with an estimator in the last step (you can only call .fit() on the pipeline); model.fit(X_train, y_train) means the following:
final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
which in your case becomes
gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform()); model.fit_transform(X_train, y_train) means the following:
final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351
I am getting error
ImportError: cannot import name 'predict' from
'sklearn.linear_model'(/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/init.py)
Tried everything ! Can anyone help!
predict is not part of the sklearn.linear_model module. It's a method of the linear models that are within the module. For example:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regression.fit(X, y)
regression.predict(X)
I tried to import cross_validation by using the following statement in python 2
from sklearn import cross_validation
but I am receiving the following error
cannot import name cross_validation
cross_validation was removed in SKlearn 0.20. You can now import it as,
from sklearn.model_selection import cross_validate
Basically all the cross validation related functions are moved under model_selection in SKlearn.
EDIT :
To import train_test_split,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
I would like to train a voiting classifer in SciKit-Learn with three different classifiers. I'm having issues with the final step, which is printing the final accuracy scores of the classifiers.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
log_clf=LogisticRegression()
rnd_clf=RandomForestClassifier()
svm_clf=SVC()
voting_clf=VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],voting='hard')
voting_clf.fit(X_train, y_train)
I am getting errors when I run the following code:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_predict=clf.predict(X_test)
print(clf._class_._name_,accuracy_score(y_test,y_pred))
When I run this chunk of code I get the following:
AttributeError: 'LogisticRegression' object has no attribute '_class_'
I am assuming that calling 'class'is a bit outdated, so I changed class to 'classes_':
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.classes_._name_,accuracy_score(y_test,y_pred))
When I run this chunk of code I get the following:
AttributeError: 'numpy.ndarray' object has no attribute '_name_'
When I remove 'name' and run the following code, I still get an error:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.classes_,accuracy_score(y_test,y_pred))
Error:
NameError: name 'accuracy_score' is not defined
I'm not sure why accuracy_score is not defined seeing that imported the library
For the first error about class, you need to have two underscores here.
Change
print(clf._class_._name_,accuracy_score(y_test,y_pred))
to:
print(clf.__class__.__name__, accuracy_score(y_test,y_pred))
See this question for other ways to get the name of an object in python:
Getting the class name of an instance?
Now for the second error about 'accuracy_score' not defined, this happens when you have not imported the accuracy_score correctly. But I can see that in your code you have imported the accuracy_score. So are you sure that you are executing the line print(clf.__class__.__name__, accuracy_score(y_test,y_pred)) in the same file? or in any different file?
I have a python script running on Amazon Web Server. Initially the CPU utilization is high, ~60%. However it gradually drops to between ~0-1% rate and ranges around there while the same script runs. Why does this happen?
My python script is as follows:
`import numpy as np
pd.set_option('max_colwidth',100)
import scipy as sp
from sklearn import preprocessing as pp
import pickle
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.model_selection import cross_val_predict, GridSearchCV, train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
pred = pd.read_pickle('./couple_data_lasso_feature_selection_predictors')
target = pd.read_pickle('./couple_data_without_resample_target')
X_train, X_test, y_train, y_test = train_test_split(pred.values, target.values.ravel(), test_size=0.3, random_state=42)
# Try TPOT
from tpot import TPOTClassifier
pipeline_optimizer = TPOTClassifier(scoring='f1', cv=5, random_state=42, verbosity=2, n_jobs=-1)
pipeline_optimizer.fit(X_train, y_train)
pipeline_optimizer.export('./tpot_exported_pipeline.py')`