Python library to perform stratified KFold cross-validation in Keras - python

I have a set of data on which I would like to train a Neural Net, although I believe my question pertains to any type of machine learning.
My data falls into two classes, however I have many more examples of class one than I do of class two. Before I go ahead and train a neural net on my data, I intend to split the data into 3 independent groups (Training, Validation and Testing), and within each one, duplicate the data I have for class one enough times so that I have equal amounts of data from each class in that group.
This is really tedious to do, and I'm willing to bet that other people have had the same problem. Is there a python library that does this for me? Or at least part of it?
tl;dr: I want a python library that splits my data into 3 parts and equalizes the amount of data I have in each class without throwing away data

Yes, use scikit-learn. Copy pasting KeironO's answer from https://github.com/fchollet/keras/issues/1711:
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
model.fit...
# fit and evaluate here.
if __name__ == "__main__":
n_folds = 10
data, labels, header_info = load_data()
skf = StratifiedKFold(labels, n_folds=n_folds, shuffle=True)
for i, (train, test) in enumerate(skf):
print "Running Fold", i+1, "/", n_folds
model = None # Clearing the NN.
model = create_model()
train_and_evaluate_model(model, data[train], labels[train], data[test], labels[test))

Related

How to predict on new data with saved OPTICS clustering model

I work with density based clustering and usually cluster on data (text) as and when I get it.
However, I want to save and re-use one of my clustering models since it would reduce memory costs by sending my data sequentially instead of having to dump all of it in memory and cluster on-demand.
I tried to achieve this by pickling my OPTICS clusterer object.
This is how I want to use the model:
def load_pickle(pickle_filepath:str):
model_file = pickle.load(open(pickle_filepath, "rb"))
return model_file
class StoredClusterer:
def __init__(self, dimred_model, clustering_model):
self.dimred_model = dimred_model
self.clustering_model = clustering_model
def predict(self, embedding):
reduced_embedding = self.dimred_model.transform(embedding)
approximate_cluster = self.clustering_model.predict(reduced_embedding)
return approximate_cluster
dimred_model = load_pickle('path/to/pickle/file')
clustering_model = load_pickle('path/to/pickle/file')
clusterer_object = StoredClusterer(dimred_model, clustering_model)
I have gone through the documentation for OPTICS; but it does not seem to have a 'predict()' like KMeans does.
Is there any way to custom build the predict() for OPTICS?
If not, what would be the best solution to remember and predict clusters with density based clustering algorithms?
Since the output of my density based algorithm is a couple hundred very fine grained clusters, it does not make sense to use the output of fit_predict() as training data for a supervised learning algorithm.

Get instance variable of costum transformer in sklearn pipeline

I am tasked with a supervised learning problem on a dataset and want to create a full Pipeline from complete beginning to end.
Starting with the train-test splitting. I wrote a custom class to implement sklearns train_test_split into the sklearn pipeline. Its fit_transform returns the training set. Later i still want to accsess the test set, so i made it an instance variable in the custom transformer class like this:
self.test_set = test_set
from sklearn.model_selection import train_test_split
class train_test_splitter([...])
[...
...]
def transform(self, X):
train_set, test_set = train_test_split(X, test_size=0.2)
self.test_set = test_set
return train_set
split_pipeline = Pipeline([
('splitter', train_test_splitter() ),
])
df_train = split_pipeline.fit_transform(df)
Now i want to get the test set like this:
df_test = splitter.test_set
Its not working. How do I get the variables of the instance "splitter". Where does it get stored?
You can access the steps of a pipeline in a number of ways. For example,
split_pipeline['splitter'].test_set
That said, I don't think this is a good approach. When you fill out the pipeline with more steps, at fit time everything will work how you want, but when predicting/transforming on other data you will still be calling your transform method, which will generate a new train-test split, forgetting the old one, and sending the new train set down the pipe for the remaining steps.

How to invoke Sagemaker XGBoost endpoint post model creation?

I have been following along with this really helpful XGBoost tutorial on Medium (code used towards bottom of article): https://medium.com/analytics-vidhya/random-forest-and-xgboost-on-amazon-sagemaker-and-aws-lambda-29abd9467795.
To-date, I've been able to get data appropriately formatted for ML purposes, a model created based on training data, and then test data fed through the model to give useful results.
Whenever I leave and come back to work more on the model or feed in new test data however, I find I need to re-run all model creation steps in order to make any further predictions. Instead I would like to just call my already created model endpoint based on the Image_URI and feed in new data.
Current steps performed:
Model Training
xgb = sagemaker.estimator.Estimator(containers[my_region],
role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path='s3://{}/{}/output'.format(bucket_name, prefix),
sagemaker_session=sess)
xgb.set_hyperparameters(eta=0.06,
alpha=0.8,
lambda_bias=0.8,
gamma=50,
min_child_weight=6,
subsample=0.5,
silent=0,
early_stopping_rounds=5,
objective='reg:linear',
num_round=1000)
xgb.fit({'train': s3_input_train})
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')
Evaluation
test_data_array = test_data.drop([ 'price','id','sqft_above','date'], axis=1).values #load the data into an array
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions[1:], sep=',') # and turn the prediction into an array
print(predictions_array.shape)
from sklearn.metrics import r2_score
print("R2 score : %.2f" % r2_score(test_data['price'],predictions_array))
It seems that this particular line:
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
needs to be re-written in order to not reference xgb.predictor but instead reference the model location.
I have tried the following
trained_model = sagemaker.model.Model(
model_data='s3://{}/{}/output/xgboost-2020-11-10-00-00/output/model.tar.gz'.format(bucket_name, prefix),
image_uri='XXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
role=role) # your role here; could be different name
trained_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
and then replaced
xgb_predictor.serializer = csv_serializer # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8') # predict!
with
trained_model.serializer = csv_serializer # set the serializer type
predictions = trained_model.predict(test_data_array).decode('utf-8') # predict!
but I get the following error:
AttributeError: 'Model' object has no attribute 'predict'
that's a good question :) I agree, many of the official tutorials tend to show the full train-to-invoke pipeline and don't emphasize enough that each step can be done separately. In your specific case, when you want to invoke an already-deployed endpoint, you can either: (A) use the invoke API call in one of the numerous SDKs (example in CLI, boto3) or (B) or instantiate a predictor with the high-level Python SDK, either the generic sagemaker.model.Model class or its XGBoost-specific child: sagemaker.xgboost.model.XGBoostPredictor as illustrated below:
from sagemaker.xgboost.model import XGBoostPredictor
predictor = XGBoostPredictor(endpoint_name='your-endpoint')
predictor.predict('<payload>')
similar question How to use a pretrained model from s3 to predict some data?
Note:
If you want the model.deploy() call to return a predictor, your model must be instantiated with a predictor_cls. This is optional, you can also first deploy a model, and then invoke it as a separate step with the above technique
Endpoints create charges even if you don't invoke them; they are charged per uptime. So if you don't need an always-on endpoint, don't hesitate to shut it down to minimize costs.

Setting up an optimization solver on top of a neural network model

I have a trained neural network model developed using the Keras framework in a Jupyter notebook. It is a regression problem, where I am trying to predict an output variable using some 14 input variables or features.
As a next step, I would like to minimize my output and want to determine what configuration/values these 14 inputs would take to get to the minimal value of the output.
So, essentially, I would like to pass the trained model object as my objective function in a solver, and also a bunch of constraints on the input variables to optimize/minimize the objective.
What is the best Python solver that can help me get there?
Thanks in advance!
So you already have your trained model, which we can think of as f(x) = y.
The standard SciPy method to minimize this is appropriately named scipy.optimize.minimize.
To use it, you just need to adapt your f(x) = y function to fit the API that SciPy uses. That is, the first function argument is the list of params to optimize over. The second argument is optional, and can contain any args that are fixed for the entire optimization (i.e. your trained model).
def score_trained_model(params, args):
# Get the model from the fixed args.
model = args[0]
# Run the model on the params, return the output.
return model_predict(model, params)
With this, plus an initial guess, you can use the minimize function now:
# Nelder-Mead is my go-to to start with.
# But it doesn't take advantage of the gradient.
# Something that does, e.g. BGFS, may perform better for your case.
method = 'Nelder-Mead'
# All zeros is fine, but improving this initial guess can help.
guess_params = [0]*14
# Given a trained model, optimize the inputs to minimize the output.
optim_params = scipy.optimize.minimize(
score_trained_model,
guess_params,
args=(trained_model,),
method=method,
)
It is possible to supply constraints and bounds to some of the optimization methods. For Nelder-Mead that is not supported, but you can just return a very large error when constraints are violated.
Older answer.
OP wants to optimize the inputs, x, not the hyperparameters.
It sounds like you want to do hyperparameter optimization. My Python library of choice is hyperopt: https://github.com/hyperopt/hyperopt
Given that you already have some training and scoring code, for example:
def train_and_score(args):
# Unpack args and train your model.
model = make_model(**args)
trained = train_model(model, **args)
# Return the output you want to minimize.
return score_model(trained)
You can easily use hyperopt to tune parameters like the learning rate, dropout, or choice of activations:
from hyperopt import fmin, hp, tpe, space_eval
space = {
'lr': hp.loguniform('lr', np.log(0.01), np.log(0.5)),
'dropout': hp.uniform('dropout', 0, 1),
'activation': hp.choice('activation', ['relu', 'sigmoid']),
}
# Minimize the training score over the space.
trials = Trials()
best = fmin(train_and_score, space, trials=trials, algo=tpe.suggest, max_evals=100)
# Print details about the best results and hyperparameters.
print(best)
print(space_eval(space, best))
There are also libraries that will help you directly integrate this with Keras. A popular choice is hyperas: https://github.com/maxpumperla/hyperas

What is the need to return a function object while creating a data set using tensorflow

I am new to Machine Learning and I am trying to create a Machine Learning Model using the Tensorflow API from the tutorial in the Tensorflow documentation from here
But I am having trouble understanding this part of the code
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
def input_function(): # inner function, this will be returned
ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # create tf.data.Dataset object with data and its label
if shuffle:
ds = ds.shuffle(1000) # randomize order of data
ds = ds.batch(batch_size).repeat(num_epochs) # split dataset into batches of 32 and repeat process for number of epochs
return ds # return a batch of the dataset
return input_function # return a function object for use
Then storing the output of the function in a variable
train_input_fn = make_input_fn(dftrain, y_train)
And at last training the model with the data set
linear_est.train(train_input_fn)
I failed to realize what we are trying to do when by just returning the function name of the inner-function in make_input_function instead of just returning our data set and passing it to train the model.
I am a beginner in Python and just started to learn Machine Learning and I am unable to find a proper answer to my question so if anyone can kindly explain it in a beginner friendly way I would be very much obliged.
I failed to realize what we are trying to do when by just returning the function name of the inner-function in make_input_function instead of just returning our data set and passing it to train the model.
In python programming, this is called Currying, It is used to transform multiple-argument function into single argument function by evaluating incremental nesting of function arguments. Currying also mends one argument to another forms a relative pattern while execution.
In tensorflow, based on the documentation (https://www.tensorflow.org/api_docs/python/tf/estimator/LinearClassifier#train).
train(
input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None
)
The method train of the estimator is expecting a parameter input_fn. The reason is that everytime you call the Estimator.train() it will create a new graph by invoking either input_fn and model_fn and connecting them together. If you supply either a tensor or a dataset it will lead to different errors.

Categories

Resources