Python error: cannot import name cross_validation - python

I tried to import cross_validation by using the following statement in python 2
from sklearn import cross_validation
but I am receiving the following error
cannot import name cross_validation

cross_validation was removed in SKlearn 0.20. You can now import it as,
from sklearn.model_selection import cross_validate
Basically all the cross validation related functions are moved under model_selection in SKlearn.
EDIT :
To import train_test_split,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Related

How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?

Assume that I want to apply several feature selection methods using sklearn pipeline. An example is provided below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
])
X_new = fs_pipeline.fit_transform(X_train, y_train)
I get the selected features using fit_transform method. If I use fit method on pipeline, I will get pipeline object.
Now, assume that I want to add a ML model to the pipeline like below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
If I use fit_transform method in the above code (model.fit_transform(X_train, y_train)), I get the error:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'
So. I should use model.fit(X_train, y_train). But, how can I be sure that pipeline applied fit_transform method for feature selection steps?
A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform()). You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.
Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier).
Whenever the last step is made of an estimator rather than a transformer, as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform().
Summing up,
case with an estimator in the last step (you can only call .fit() on the pipeline); model.fit(X_train, y_train) means the following:
final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
which in your case becomes
gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))
case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform()); model.fit_transform(X_train, y_train) means the following:
final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351

'function' object has no attribute 'train_test_split'

I am new to implementing Machine Learning in Python and am currently trying out KNN classification following YouTube tutorials. Here is the code.
import numpy as np
#from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
import pandas as pd
df=pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?', -99999, inplace=True)
df.drop(['id'],1,inplace=True)
X=np.array(df.drop(['class'],1))
y=np.array(df['class'])
X_train, X_test, y_train, y_test=cross_validate.train_test_split(X,y,test_size=0.2)
I get the following error: -
X_train, X_test, y_train, y_test=cross_validate.train_test_split(X,y,test_size=0.2)
AttributeError: 'function' object has no attribute 'train_test_split'
I tried importing train_test_split as
from sklearn.model_selection import train_test_split
but then I get the same error. Any help is appreciated. Thanks!
train_test_split is a separate module (docs), and it is not to be used in combination with cross_validate; the correct usage here is (assuming scikit-learn v0.20):
from sklearn.model_selection import train_test_split
# [...]
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2)
sklearn.cross_validation deprecated in version 0.20.
Use sklearn.model_selection.train_test_split

Issues printing the class, name and accuracy score of the voting classifier

I would like to train a voiting classifer in SciKit-Learn with three different classifiers. I'm having issues with the final step, which is printing the final accuracy scores of the classifiers.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
log_clf=LogisticRegression()
rnd_clf=RandomForestClassifier()
svm_clf=SVC()
voting_clf=VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],voting='hard')
voting_clf.fit(X_train, y_train)
I am getting errors when I run the following code:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_predict=clf.predict(X_test)
print(clf._class_._name_,accuracy_score(y_test,y_pred))
When I run this chunk of code I get the following:
AttributeError: 'LogisticRegression' object has no attribute '_class_'
I am assuming that calling 'class'is a bit outdated, so I changed class to 'classes_':
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.classes_._name_,accuracy_score(y_test,y_pred))
When I run this chunk of code I get the following:
AttributeError: 'numpy.ndarray' object has no attribute '_name_'
When I remove 'name' and run the following code, I still get an error:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(clf.classes_,accuracy_score(y_test,y_pred))
Error:
NameError: name 'accuracy_score' is not defined
I'm not sure why accuracy_score is not defined seeing that imported the library
For the first error about class, you need to have two underscores here.
Change
print(clf._class_._name_,accuracy_score(y_test,y_pred))
to:
print(clf.__class__.__name__, accuracy_score(y_test,y_pred))
See this question for other ways to get the name of an object in python:
Getting the class name of an instance?
Now for the second error about 'accuracy_score' not defined, this happens when you have not imported the accuracy_score correctly. But I can see that in your code you have imported the accuracy_score. So are you sure that you are executing the line print(clf.__class__.__name__, accuracy_score(y_test,y_pred)) in the same file? or in any different file?

Why does the AWS-EC2 CPU utilization drops gradually while the same python script is running?

I have a python script running on Amazon Web Server. Initially the CPU utilization is high, ~60%. However it gradually drops to between ~0-1% rate and ranges around there while the same script runs. Why does this happen?
My python script is as follows:
`import numpy as np
pd.set_option('max_colwidth',100)
import scipy as sp
from sklearn import preprocessing as pp
import pickle
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.model_selection import cross_val_predict, GridSearchCV, train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
pred = pd.read_pickle('./couple_data_lasso_feature_selection_predictors')
target = pd.read_pickle('./couple_data_without_resample_target')
X_train, X_test, y_train, y_test = train_test_split(pred.values, target.values.ravel(), test_size=0.3, random_state=42)
# Try TPOT
from tpot import TPOTClassifier
pipeline_optimizer = TPOTClassifier(scoring='f1', cv=5, random_state=42, verbosity=2, n_jobs=-1)
pipeline_optimizer.fit(X_train, y_train)
pipeline_optimizer.export('./tpot_exported_pipeline.py')`

How to perform simple grid search with Apache Spark

I tried to use Scikit Learn's GridSearch class to tune the hyper parameters of my logistic regression algorithm.
However GridSearch, even when using multiple jobs in parallel, takes literally days to process unless you are only tuning one parameter. I thought about using Apache Spark to speed this process up, but I have two questions.
In order to use Apache Spark, do you literally need multiple machines to distribute the workload ? For example, if you only have 1 laptop, is it pointless to use Apache Spark ?
Is there a simple way to use Scikit Learn's GridSearch in Apache Spark ?
I have read the documentation, but it talks about running parallel workers on an entire machine learning pipeline, but I just want it for the parameter tuning.
Imports
import datetime
%matplotlib inline
import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from datetime import datetime as dt
import scipy
import itertools
ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')
pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)
Algorithm Hyper Parameter Tuning
X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
knn = KNeighborsClassifier()
parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'],
'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}
# ======== What I want to do in Apache Spark ========= #
%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_
# ==================================================== #
You can use a library called spark-sklearn to run distributed parameter sweeps. You're correct in that you'd need a cluster of machines, or a single multi-CPU machine to get parallel speedup.
Hope this helps,
Roope - Microsoft MMLSpark Team

Categories

Resources