I tried to use Scikit Learn's GridSearch class to tune the hyper parameters of my logistic regression algorithm.
However GridSearch, even when using multiple jobs in parallel, takes literally days to process unless you are only tuning one parameter. I thought about using Apache Spark to speed this process up, but I have two questions.
In order to use Apache Spark, do you literally need multiple machines to distribute the workload ? For example, if you only have 1 laptop, is it pointless to use Apache Spark ?
Is there a simple way to use Scikit Learn's GridSearch in Apache Spark ?
I have read the documentation, but it talks about running parallel workers on an entire machine learning pipeline, but I just want it for the parameter tuning.
Imports
import datetime
%matplotlib inline
import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from datetime import datetime as dt
import scipy
import itertools
ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')
pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)
Algorithm Hyper Parameter Tuning
X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
knn = KNeighborsClassifier()
parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'],
'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}
# ======== What I want to do in Apache Spark ========= #
%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_
# ==================================================== #
You can use a library called spark-sklearn to run distributed parameter sweeps. You're correct in that you'd need a cluster of machines, or a single multi-CPU machine to get parallel speedup.
Hope this helps,
Roope - Microsoft MMLSpark Team
Related
I have combined random forest with adaboost as
clf = AdaBoostClassifier(n_estimators=10, base_estimator=RandomForestClassifier(n_estimators=10,max_depth=20))
now i want to combine adaboost with xgboost and i have tried like this:
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
clf = AdaBoostClassifier(base_estimator=XGBClassifier(eval_metric='mlogloss'))
and it is not working correctly. How to do this?
use would just use it like this
import lib1, lib2, lib3, lib4, lib5
I'm trying to do a ShuffleSplit with a GridSearchCV in scikit-learn.
Here's my MWE, which is a modification to what one can find in the Deep Learning with Python book, by François Chollet. In the book, he doesn't use scikit-learn.
from keras import models
from keras import layers
import numpy as np
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import ShuffleSplit
from keras.datasets import boston_housing
(train_data,train_targets),(test_data,test_targets)=boston_housing.load_data()
mean=np.mean(train_data)
std=np.std(train_data)
train_data_norm=(train_data-mean)/std
test_data_norm=(test_data-mean)/std
def build_model():
model=models.Sequential()
model.add(layers.Dense(64,activation="relu",
input_shape=(train_data_norm.shape[1],)))
model.add(layers.Dense(64,activation="relu"))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',loss="mse",metrics=["mae"])
return model
model=KerasRegressor(build_fn=build_model,epochs=30,verbose=0)
param_grid = {"epochs":range(1,11)}
ss = ShuffleSplit(n_splits=4, test_size=0.1, random_state=0)
grid_model=GridSearchCV(model,param_grid,cv=ss,n_jobs=-1,scoring='neg_mean_squared_error')
grid_model.fit(train_data, train_targets)
mean_squared_error(grid_model.predict(test_data),test_targets)
One thing I find strange is that I when using ShuffleSplit, I have to define again the size of my test data, which I'll apply only to (train_data,train_targets) when fitting the model. Also, I thought that using ShuffleSplit would stabilise the MSE prediction performance, when compared to a simple CV, but the opposite happens. If I use
grid_model=GridSearchCV(model,param_grid,cv=4,n_jobs=-1,scoring='neg_mean_squared_error')
instead, then I'll have a smaller MSE when predicting for the test_data.
Am I coding correctly the use of a ShuffleSplit in GridSearchCV?
I tried to import cross_validation by using the following statement in python 2
from sklearn import cross_validation
but I am receiving the following error
cannot import name cross_validation
cross_validation was removed in SKlearn 0.20. You can now import it as,
from sklearn.model_selection import cross_validate
Basically all the cross validation related functions are moved under model_selection in SKlearn.
EDIT :
To import train_test_split,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
I have a python script running on Amazon Web Server. Initially the CPU utilization is high, ~60%. However it gradually drops to between ~0-1% rate and ranges around there while the same script runs. Why does this happen?
My python script is as follows:
`import numpy as np
pd.set_option('max_colwidth',100)
import scipy as sp
from sklearn import preprocessing as pp
import pickle
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.model_selection import cross_val_predict, GridSearchCV, train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
pred = pd.read_pickle('./couple_data_lasso_feature_selection_predictors')
target = pd.read_pickle('./couple_data_without_resample_target')
X_train, X_test, y_train, y_test = train_test_split(pred.values, target.values.ravel(), test_size=0.3, random_state=42)
# Try TPOT
from tpot import TPOTClassifier
pipeline_optimizer = TPOTClassifier(scoring='f1', cv=5, random_state=42, verbosity=2, n_jobs=-1)
pipeline_optimizer.fit(X_train, y_train)
pipeline_optimizer.export('./tpot_exported_pipeline.py')`
In sklearn, it is possible to define a pipeline in the following way:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
Is it also possible to include custom feature extraction methods like
extract_features(image, cspace='RGB',
pix_per_cell=128, cell_per_block=32,
hog_channel=10)
and how would I do that?