Doing a ShuffleSplit with GridSearchCV - python

I'm trying to do a ShuffleSplit with a GridSearchCV in scikit-learn.
Here's my MWE, which is a modification to what one can find in the Deep Learning with Python book, by François Chollet. In the book, he doesn't use scikit-learn.
from keras import models
from keras import layers
import numpy as np
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import ShuffleSplit
from keras.datasets import boston_housing
(train_data,train_targets),(test_data,test_targets)=boston_housing.load_data()
mean=np.mean(train_data)
std=np.std(train_data)
train_data_norm=(train_data-mean)/std
test_data_norm=(test_data-mean)/std
def build_model():
model=models.Sequential()
model.add(layers.Dense(64,activation="relu",
input_shape=(train_data_norm.shape[1],)))
model.add(layers.Dense(64,activation="relu"))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',loss="mse",metrics=["mae"])
return model
model=KerasRegressor(build_fn=build_model,epochs=30,verbose=0)
param_grid = {"epochs":range(1,11)}
ss = ShuffleSplit(n_splits=4, test_size=0.1, random_state=0)
grid_model=GridSearchCV(model,param_grid,cv=ss,n_jobs=-1,scoring='neg_mean_squared_error')
grid_model.fit(train_data, train_targets)
mean_squared_error(grid_model.predict(test_data),test_targets)
One thing I find strange is that I when using ShuffleSplit, I have to define again the size of my test data, which I'll apply only to (train_data,train_targets) when fitting the model. Also, I thought that using ShuffleSplit would stabilise the MSE prediction performance, when compared to a simple CV, but the opposite happens. If I use
grid_model=GridSearchCV(model,param_grid,cv=4,n_jobs=-1,scoring='neg_mean_squared_error')
instead, then I'll have a smaller MSE when predicting for the test_data.
Am I coding correctly the use of a ShuffleSplit in GridSearchCV?

Related

How can i combine xgboost with adaboost?

I have combined random forest with adaboost as
clf = AdaBoostClassifier(n_estimators=10, base_estimator=RandomForestClassifier(n_estimators=10,max_depth=20))
now i want to combine adaboost with xgboost and i have tried like this:
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
clf = AdaBoostClassifier(base_estimator=XGBClassifier(eval_metric='mlogloss'))
and it is not working correctly. How to do this?
use would just use it like this
import lib1, lib2, lib3, lib4, lib5

Using Scikit-Learn's pipelines to combine a transformers and estimator

I try to use Scikit-Learn's Pipeline function to organize our transformers and estimator, and having problem with building a pipeline that combines one_hot_transformer with a LinearRegression() estimator. It is challenging to connect the following ones
from sklearn.preprocessing import OneHotEncoder
cat_feats = np.array([[1,10],[2,20],[3,10],[4,20],[3,10],[2,20],[1,10]])
OneHotEncoder(sparse=False).fit_transform(cat_feats)
one_hot_transformer = OneHotEncoder(sparse=False).fit_transform(X,y)
from sklearn.pipeline import Pipeline
linear_est = Pipeline([one_hot_transformer], LinearRegression())
linear_est.fit(X,y)
predicted = linear_est.predict(X)
grader.score('intro_ml__linear_model', linear_est.predict)

How to perform simple grid search with Apache Spark

I tried to use Scikit Learn's GridSearch class to tune the hyper parameters of my logistic regression algorithm.
However GridSearch, even when using multiple jobs in parallel, takes literally days to process unless you are only tuning one parameter. I thought about using Apache Spark to speed this process up, but I have two questions.
In order to use Apache Spark, do you literally need multiple machines to distribute the workload ? For example, if you only have 1 laptop, is it pointless to use Apache Spark ?
Is there a simple way to use Scikit Learn's GridSearch in Apache Spark ?
I have read the documentation, but it talks about running parallel workers on an entire machine learning pipeline, but I just want it for the parameter tuning.
Imports
import datetime
%matplotlib inline
import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from datetime import datetime as dt
import scipy
import itertools
ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')
pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)
Algorithm Hyper Parameter Tuning
X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
knn = KNeighborsClassifier()
parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'],
'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}
# ======== What I want to do in Apache Spark ========= #
%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_
# ==================================================== #
You can use a library called spark-sklearn to run distributed parameter sweeps. You're correct in that you'd need a cluster of machines, or a single multi-CPU machine to get parallel speedup.
Hope this helps,
Roope - Microsoft MMLSpark Team

SKlearn pipeline using KNeighborsClassifier

I am trying to build a GridSearchCV pipeline in sklearn for using KNeighborsClassifier and SVM. SO far, have tried the following code:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
from sklearn import svm
from sklearn.svm import SVC
clf = SVC(kernel='linear')
pipeline = Pipeline([ ('knn',neigh), ('sVM', clf)]) # Code breaks here
weight_options = ['uniform','distance']
param_knn = {'weights':weight_options}
param_svc = {'kernel':('linear', 'rbf'), 'C':[1,5,10]}
grid = GridSearchCV(pipeline, param_knn, param_svc, cv=5, scoring='accuracy')
but am getting the following error:
TypeError: All intermediate steps should be transformers and implement fit and transform. 'KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')' (type <class 'sklearn.neighbors.classification.KNeighborsClassifier'>) doesn't
Can anyone please help me with what am I going wrong, and how to correct it? I think there is something wrong with the last line as well, re params.
The error clearly says that the KNeighborsClassifier doesnt have transform method KNN has only fit method where as SVM has fit_transform() method. for the Pipeline we can pass n number of arguments in to it. but all the arguments should have transformer methods in it.Please refer the below link
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
The scikit-learn Pipeline steps require to have the transform() method. You might want to try the pipeline from imblearn instead.
See for instance here: https://bsolomon1124.github.io/oversamp/

cross_val_score fails with tensorflow(skflow)

I am using python 3.5 with tensorflow 0.11 and sklearn 0.18.
I wrote a simple example code to calculate the cross-validation score with iris data using tensorflow. I used the skflow as the wrapper.
import tensorflow.contrib.learn as skflow
from sklearn import datasets
from sklearn import cross_validation
iris=datasets.load_iris()
feature_columns = skflow.infer_real_valued_columns_from_input(iris.data)
classifier = skflow.DNNClassifier(hidden_units=[10, 10, 10], n_classes=3, feature_columns=feature_columns)
print(cross_validation.cross_val_score(classifier, iris.data, iris.target, cv=2, scoring = 'accuracy'))
But I got an error like below. It seems that skflow is not compatible with cross_val_score of sklearn.
TypeError: Cannot clone object '' (type ): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
Is there any other way to deal with this problem?

Categories

Resources