Batch transform sparse matrix with AWS SageMaker Python SDK - python

I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).
I struggle because of:
the size of the data
the format of the data
I did several experiments to check what was going on:
1. predict locally on sample sparse matrix data
It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.
2. Batch Transform on a sample csv data
It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:
sklearn_model = SKLearnModel(
model_data=model_artifact_location_on_s3,
entry_point='my_script.py',
role=role,
sagemaker_session=sagemaker_session)
transformer = sklearn_model.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
max_payload=100)
transformer.transform(
data=batch_data,
content_type='text/csv',
split_type=None)
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
where:
'my_script.py' implements a simple model_fn to deserialize the model artifact:
def model_fn(model_dir):
clf = joblib.load(os.path.join(model_dir, "model.joblib"))
return clf
batch_data is the s3 path for the csv file.
3. Batch Transform of a sample dense numpy dataset.
It works
I prepared a sample of the data and saved it to s3 in Numpy .npy format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).
The only difference with the previous experiment (2) is the argument content_type='application/x-npy' in transformer.transform(...).
This solution does not scale and we would like to pass a Scipy sparse matrix:
4. Batch Transform of a big sparse matrix.
Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:
https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/
Errors running Sagemaker Batch Transformation with LDA model
I used write_spmatrix_to_sparse_tensor to write the data to protobuf format on s3. The function I used is:
def write_protobuf(X_sparse, bucket, prefix, obj):
"""Write sparse matrix to protobuf format at location bucket/prefix/obj."""
buf = io.BytesIO()
write_spmatrix_to_sparse_tensor(file=buf, array=X_sparse, labels=None)
buf.seek(0)
key = '{}/{}'.format(prefix, obj)
boto3.resource('s3').Bucket(bucket).Object(key).upload_fileobj(buf)
return 's3://{}/{}'.format(bucket, key)
Then the code used for launching the batch transform job is:
sklearn_model = SKLearnModel(
model_data=model_artifact_location_on_s3,
entry_point='my_script.py',
role=role,
sagemaker_session=sagemaker_session)
transformer = sklearn_model.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
max_payload=100)
transformer.transform(
data=batch_data,
content_type='application/x-recordio-protobuf',
split_type='RecordIO')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
I get the following error:
sagemaker_containers._errors.ClientError: Content type application/x-recordio-protobuf is not supported by this framework.
Questions:
(Reference doc for Transformer: https://sagemaker.readthedocs.io/en/stable/transformer.html)
If content_type='application/x-recordio-protobuf' is not allowed, what should I use?
Is split_type='RecordIO' the proper setting in this context?
Should I provide an input_fn function in my script to deserialize the data?
Is there another better approach to tackle this problem?

Related

What to do when the data is too big to be stored in memory

I want to train a neural network, I work with Python (3.6.9) and Tensorflow (2.4.0) and my problem is that my dataset is too big to be stored in memory.
A bit of context :
My network takes in input a small complex matrix of dimension 64 by 32.
My dataset is stored in the form of a very large ".mat" file generated by a matlab code.
In the mat file, the samples are stored in a large cell array.
I use the h5py library to open the mat file.
Example of python code to load only one sample from the file :
f = h5py.File('dataset.mat', 'r')
refs = f['data'] # array of reference of each sample
sample = f[refs[0]][()].view(np.complex) # load the first sample
Currently, I load only a small part of the dataset that I store in a tensorflow dataset (ds = tf.data.Dataset.from_tensor_slices(datas)).
I would like to take advantage of the possibility offered by the h5py library to be able to load each example individually to load the examples on the fly during network training.
I tried the following approach:
f = h5py.File('dataset.mat', 'r')
refs = f['data'] # array of reference of each sample
ds_index = tf.data.Dataset.range(len(refs))
ds = ds_index.map(lambda i : f[refs[i]][()].view(np.complex))
but, I have the following error :
NotImplementedError: in user code:
<ipython-input-66-6cf802c8359a>:15 __call__ *
return self._f[self._rs[i]]['channel'][()].view(np.complex).astype(np.complex64).T
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:855 __array__
" a NumPy call, which is not supported".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (args_0:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
Do you know how to fix this error or can it be a better way to load examples on the fly ?

How to validate data during inference using tfx / tfdv / tensorflow serving?

I'm building a tfx pipeline and using tensorflow serving to serve my model. I save the signature with model.save(...).
So far I was able to use the transform layer to transform the feature before prediction with tf_transform_output.transform_features_layer() (see my code below).
However, I'm wondering how one can detect anomalies in the input data? For instance, I don't want to predict for an input value that is too far away from the distribution that a feature was trained with before.
The tfdv library offers functions like generate_statistics_from_[csv|dataframe|tfrecord] but I was not able to find any good example to generate statistics for serialized tf.Examples (or something that is not saved in a file, like csv, tfrecords etc.).
I'm aware of the following example in the documentation:
import tensorflow_data_validation as tfdv
import tfx_bsl
import pyarrow as pa
decoder = tfx_bsl.coders.example_coder.ExamplesToRecordBatchDecoder()
example = decoder.DecodeBatch([serialized_tfexample])
options = tfdv.StatsOptions(schema=schema)
anomalies = tfdv.validate_instance(example, options)
But in this example serialized_tfexample is a string, whereas in my code below the argument serialized_tf_examples is a Tensor of strings.
Sorry if this is an obvious question. I spent all day to find a solution without success. Maybe I'm getting this all thing wrong. Maybe this is not the right place to put validations. So my more generalized question is actually: How do you validate incoming input data before prediction when you serve a model, which you created through a tfx pipeline, in production?
I'm thankful for any lead into the right direction.
Here is my code to which I want to add validation:
...
tf_transform_output = tft.TFTransformOutput(...)
model.tft_layer = tf_transform_output.transform_features_layer()
#tf.function(input_signature=[
tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')
])
def serve_tf_examples_fn(serialized_tf_examples):
#### How can I generate stats and validate serialized_tf_examples? ###
#### Is this the right place? ###
feature_spec = tf_transform_output.raw_feature_spec()
feature_spec.pop(TARGET_LABEL)
parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)
transformed_features = model.tft_layer(parsed_features)
return model(transformed_features)
...
model.save(serving_model_dir,
save_format='tf',
signatures={
'serving_default': serve_tf_examples_fn
})

How to load Fashion MNIST dataset in Tensorflow Fedarated?

I am working on a project with Tensorflow federated. I have managed to use the libraries provided by TensorFlow Federated Learning simulations in order to load, train, and test some datasets.
For example, i load the emnist dataset
emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()
and it got the data sets returned by load_data() as instances of tff.simulation.ClientData. This is an interface that allows me to iterate over client ids and allow me to select subsets of the data for simulations.
len(emnist_train.client_ids)
3383
emnist_train.element_type_structure
OrderedDict([('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None)), ('label', TensorSpec(shape=(), dtype=tf.int32, name=None))])
example_dataset = emnist_train.create_tf_dataset_for_client(
emnist_train.client_ids[0])
I am trying to load the fashion_mnist dataset with Keras to perform some federated operations:
fashion_train,fashion_test=tf.keras.datasets.fashion_mnist.load_data()
but I get this error
AttributeError: 'tuple' object has no attribute 'element_spec'
because Keras returns a Tuple of Numpy arrays instead of a tff.simulation.ClientData like before:
def tff_model_fn() -> tff.learning.Model:
return tff.learning.from_keras_model(
keras_model=factory.retrieve_model(True),
input_spec=fashion_test.element_spec,
loss=loss_builder(),
metrics=metrics_builder())
iterative_process = tff.learning.build_federated_averaging_process(
tff_model_fn, Parameters.server_adam_optimizer_fn, Parameters.client_adam_optimizer_fn)
server_state = iterative_process.initialize()
To sum up,
Is any way to create tuple elements of tff.simulation.ClientData from Keras Tuple Numpy arrays?
Another solution that comes to my mind is to use the
tff.simulation.HDF5ClientData and load
manually the appropriate files in aHDF5format (train.h5, test.h5) in order to get the tff.simulation.ClientData, but my problem is that i cant find the url for fashion_mnist HDF5 file format i mean something like that for both train and test:
fileprefix = 'fed_emnist_digitsonly'
sha256 = '55333deb8546765427c385710ca5e7301e16f4ed8b60c1dc5ae224b42bd5b14b'
filename = fileprefix + '.tar.bz2'
path = tf.keras.utils.get_file(
filename,
origin='https://storage.googleapis.com/tff-datasets-public/' + filename,
file_hash=sha256,
hash_algorithm='sha256',
extract=True,
archive_format='tar',
cache_dir=cache_dir)
dir_path = os.path.dirname(path)
train_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_train.h5'))
test_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_test.h5'))
return train_client_data, test_client_data
My final goal is to make the fashion_mnist dataset work with the TensorFlow federated learning.
You're on the right track. To recap: the datasets returned by tff.simulation.dataset APIs are tff.simulation.ClientData objects. The object returned by tf.keras.datasets.fashion_mnist.load_data is a tuple of numpy arrays.
So what is needed is to implement a tff.simulation.ClientData to wrap the dataset returned by tf.keras.datasets.fashion_mnist.load_data. Some previous questions about implementing ClientData objects:
Federated learning : convert my own image dataset into tff simulation Clientdata
How define tff.simulation.ClientData.from_clients_and_fn Function?
Is there a reasonable way to create tff clients datat sets?
This does require answering an important question: how should the Fashion MNIST data be split into individual users? The dataset doesn't include features that that could be used for partitioning. Researchers have come up with a few ways to synthetically partition the data, e.g. randomly sampling some labels for each participant, but this will have a great effect on model training and is useful to invest some thought here.

Pre processing keras dataset using keras tokenizer

I am trying to do some pre processing using the keras tokenizer on data I read using the following code:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.interleave(lambda x:
tf.data.TFRecordDataset(x).prefetch(params.num_parallel_readers),
cycle_length=params.num_parallel_readers,
block_length=1)
dataset = dataset.map(_parse_example, num_parallel_calls = params.num_parallel_calls)
Now that I have the parsed example (output of _parse_example map function) I want to do some pre-processing on the text using tf.keras.preprocessing.text.Tokenizer method texts_to_sequences.
However, texts_to_sequences expects an input of python strings and I get Tensors in the parsed_example.
I can work around it by using py_func to wrap my code (see 'emb': tf.py_func.. in the code below), but then I will not be able to serialize my model (according to the py_func documentation).
dataset = dataset.map(lambda features, labels:
({'window': features['window'],
'winSize': features['winSize'],
'LandingPage': features['LandingPage'],
'emb': tf.py_func(getEmb, [features['window']], tf.int32)},
tf.one_hot(labels, hparams.numClasses) ))
Looking for a way to do that (or a link to some similar example)

Tensorflow dataset data preprocessing is done once for the whole dataset or for each call to iterator.next()?

Hi I am studying the dataset API in tensorflow now and I have a question regarding to the dataset.map() function which performs data preprocessing.
file_name = ["image1.jpg", "image2.jpg", ......]
im_dataset = tf.data.Dataset.from_tensor_slices(file_names)
im_dataset = im_dataset.map(lambda image:tuple(tf.py_func(image_parser(), [image], [tf.float32, tf.float32, tf.float32])))
im_dataset = im_dataset.batch(batch_size)
iterator = im_dataset.make_initializable_iterator()
The dataset takes in image names and parse them into 3 tensors (3 infos about the image).
If I have a very larger number of images in my training folder, preprocessing them is gonna take a long time.
My question is that, since Dataset API is said to be designed for efficient input pipeline, the preprocessing is done for the whole dataset before I feed them to my workers (let's say GPUs), or it only preprocess one batch of image each time I call iterator.get_next()?
If your preprocessing pipeline is very long and the output is small, the processed data should fit in memory. If this is the case, you can use tf.data.Dataset.cache to cache the processed data in memory or in a file.
From the official performance guide:
The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. If the user-defined function passed into the map transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or local storage. If the user-defined function increases the space required to store the dataset beyond the cache capacity, consider pre-processing your data before your training job to reduce resource usage.
Example use of cache in memory
Here is an example where each pre-processing takes a lot of time (0.5s). The second epoch on the dataset will be much faster than the first
def my_fn(x):
time.sleep(0.5)
return x
def parse_fn(x):
return tf.py_func(my_fn, [x], tf.int64)
dataset = tf.data.Dataset.range(5)
dataset = dataset.map(parse_fn)
dataset = dataset.cache() # cache the processed dataset, so every input will be processed once
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
# First 5 iterations will take 0.5s each, last 5 will not
print(sess.run(res))
Caching to a file
If you want to write the cached data to a file, you can provide an argument to cache():
dataset = dataset.cache('/tmp/cache') # will write cached data to a file
This will allow you to only process the dataset once, and run multiple experiments on the data without reprocessing it again.
Warning: You have to be careful when caching to a file. If you change your data, but keep the /tmp/cache.* files, it will still read the old data that was cached. For instance, if we use the data from above and change the range of the data to be in [10, 15], we will still obtain data in [0, 5]:
dataset = tf.data.Dataset.range(10, 15)
dataset = dataset.map(parse_fn)
dataset = dataset.cache('/tmp/cache')
dataset = dataset.repeat(2) # repeat for multiple epochs
res = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(res)) # will still be in [0, 5]...
Always delete the cached files whenever the data that you want to cache changes.
Another issue that may arise is if you interrupt the script before all the data is cached. You will receive an error like this:
AlreadyExistsError (see above for traceback): There appears to be a concurrent caching iterator running - cache lockfile already exists ('/tmp/cache.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator.
Make sure that you let the whole dataset be processed to have an entire cache file.

Categories

Resources