Pre processing keras dataset using keras tokenizer - python

I am trying to do some pre processing using the keras tokenizer on data I read using the following code:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.interleave(lambda x:
tf.data.TFRecordDataset(x).prefetch(params.num_parallel_readers),
cycle_length=params.num_parallel_readers,
block_length=1)
dataset = dataset.map(_parse_example, num_parallel_calls = params.num_parallel_calls)
Now that I have the parsed example (output of _parse_example map function) I want to do some pre-processing on the text using tf.keras.preprocessing.text.Tokenizer method texts_to_sequences.
However, texts_to_sequences expects an input of python strings and I get Tensors in the parsed_example.
I can work around it by using py_func to wrap my code (see 'emb': tf.py_func.. in the code below), but then I will not be able to serialize my model (according to the py_func documentation).
dataset = dataset.map(lambda features, labels:
({'window': features['window'],
'winSize': features['winSize'],
'LandingPage': features['LandingPage'],
'emb': tf.py_func(getEmb, [features['window']], tf.int32)},
tf.one_hot(labels, hparams.numClasses) ))
Looking for a way to do that (or a link to some similar example)

Related

Store original data (e.g., text, image) along with tensor data in Pytorch Dataloader

Currently, I am using TensorDataset followed by DataLoader to load my dataset like below:
tensor_loader = TensorDataset(x_input_ids,x_seg_ids,x_atten_masks,y)
data_loader = DataLoader(tensor_loader, shuffle=True, batch_size=batch_size)
I now want to also store original (text) data along with the tensor data in the data_loader like below:
tensor_loader = TensorDataset(x_input_ids,x_seg_ids,x_atten_masks,y, x_input_strs)
Note: x_input_strs is text data corresponding to x_input_ids but it fails since TensorDataset allows only tensors. I also tried something like this:
tensor_loader = Dataset(x_input_ids,x_seg_ids,x_atten_masks,y, x_input_strs)
But it gives the following error:
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
Any suggestions are appreciated.

understanding what keras and TensorFlow to use in text classification

I was trying to classify my text in tensorflow and keras and every time I tried using the keras to read my files from my directory then it would through and error that the features for reading the text was not available yet it is included in the documentation
I made my own file reader functionality which is here how to read text files in keras using os.walk and converting to batched dataset and now trying to vectorize my text using keras preprocessing then again the module is not available
as asked in the comment I was trying to use keras https://keras.io/api/preprocessing/text/ guide and the one https://keras.io/examples/nlp/text_classification_from_scratch/ here which uses the vectorizing, I started digging into uisng tf to make tokens from the text and I found how to vectorize the text here https://www.tensorflow.org/tutorials/text/word2vec
but now the problem was that I could not use the functions because I could not understand it very well my code was as follows
train_dataset = get_files_from_dir(train_path,batch_size=batch_size, seed=seed) # calls the fetch text which returns dataset of text and labels batched
text_ds = train_dataset.map(lambda x, y: x) # get featues only (text with no labels)
def vec_maker(text):
tokens = text.lower().split()
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
if token not in vocab:
vocab[token] = index
index += 1
self.vocab =vocab
return text
now my problem is how do I map the text_ds to the function to make the vectors because if I try passing the variable as function arguments it direct is says that
File"/home/kim/Desktop/programs/python/text_processing/prog/text_process.py", line 78, in vec_maker
tokens = text.lower().split()
AttributeError: 'MapDataset' object has no attribute 'lower'
help and explanation will much be appreciated

How to validate data during inference using tfx / tfdv / tensorflow serving?

I'm building a tfx pipeline and using tensorflow serving to serve my model. I save the signature with model.save(...).
So far I was able to use the transform layer to transform the feature before prediction with tf_transform_output.transform_features_layer() (see my code below).
However, I'm wondering how one can detect anomalies in the input data? For instance, I don't want to predict for an input value that is too far away from the distribution that a feature was trained with before.
The tfdv library offers functions like generate_statistics_from_[csv|dataframe|tfrecord] but I was not able to find any good example to generate statistics for serialized tf.Examples (or something that is not saved in a file, like csv, tfrecords etc.).
I'm aware of the following example in the documentation:
import tensorflow_data_validation as tfdv
import tfx_bsl
import pyarrow as pa
decoder = tfx_bsl.coders.example_coder.ExamplesToRecordBatchDecoder()
example = decoder.DecodeBatch([serialized_tfexample])
options = tfdv.StatsOptions(schema=schema)
anomalies = tfdv.validate_instance(example, options)
But in this example serialized_tfexample is a string, whereas in my code below the argument serialized_tf_examples is a Tensor of strings.
Sorry if this is an obvious question. I spent all day to find a solution without success. Maybe I'm getting this all thing wrong. Maybe this is not the right place to put validations. So my more generalized question is actually: How do you validate incoming input data before prediction when you serve a model, which you created through a tfx pipeline, in production?
I'm thankful for any lead into the right direction.
Here is my code to which I want to add validation:
...
tf_transform_output = tft.TFTransformOutput(...)
model.tft_layer = tf_transform_output.transform_features_layer()
#tf.function(input_signature=[
tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')
])
def serve_tf_examples_fn(serialized_tf_examples):
#### How can I generate stats and validate serialized_tf_examples? ###
#### Is this the right place? ###
feature_spec = tf_transform_output.raw_feature_spec()
feature_spec.pop(TARGET_LABEL)
parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)
transformed_features = model.tft_layer(parsed_features)
return model(transformed_features)
...
model.save(serving_model_dir,
save_format='tf',
signatures={
'serving_default': serve_tf_examples_fn
})

How to load Fashion MNIST dataset in Tensorflow Fedarated?

I am working on a project with Tensorflow federated. I have managed to use the libraries provided by TensorFlow Federated Learning simulations in order to load, train, and test some datasets.
For example, i load the emnist dataset
emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()
and it got the data sets returned by load_data() as instances of tff.simulation.ClientData. This is an interface that allows me to iterate over client ids and allow me to select subsets of the data for simulations.
len(emnist_train.client_ids)
3383
emnist_train.element_type_structure
OrderedDict([('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None)), ('label', TensorSpec(shape=(), dtype=tf.int32, name=None))])
example_dataset = emnist_train.create_tf_dataset_for_client(
emnist_train.client_ids[0])
I am trying to load the fashion_mnist dataset with Keras to perform some federated operations:
fashion_train,fashion_test=tf.keras.datasets.fashion_mnist.load_data()
but I get this error
AttributeError: 'tuple' object has no attribute 'element_spec'
because Keras returns a Tuple of Numpy arrays instead of a tff.simulation.ClientData like before:
def tff_model_fn() -> tff.learning.Model:
return tff.learning.from_keras_model(
keras_model=factory.retrieve_model(True),
input_spec=fashion_test.element_spec,
loss=loss_builder(),
metrics=metrics_builder())
iterative_process = tff.learning.build_federated_averaging_process(
tff_model_fn, Parameters.server_adam_optimizer_fn, Parameters.client_adam_optimizer_fn)
server_state = iterative_process.initialize()
To sum up,
Is any way to create tuple elements of tff.simulation.ClientData from Keras Tuple Numpy arrays?
Another solution that comes to my mind is to use the
tff.simulation.HDF5ClientData and load
manually the appropriate files in aHDF5format (train.h5, test.h5) in order to get the tff.simulation.ClientData, but my problem is that i cant find the url for fashion_mnist HDF5 file format i mean something like that for both train and test:
fileprefix = 'fed_emnist_digitsonly'
sha256 = '55333deb8546765427c385710ca5e7301e16f4ed8b60c1dc5ae224b42bd5b14b'
filename = fileprefix + '.tar.bz2'
path = tf.keras.utils.get_file(
filename,
origin='https://storage.googleapis.com/tff-datasets-public/' + filename,
file_hash=sha256,
hash_algorithm='sha256',
extract=True,
archive_format='tar',
cache_dir=cache_dir)
dir_path = os.path.dirname(path)
train_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_train.h5'))
test_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_test.h5'))
return train_client_data, test_client_data
My final goal is to make the fashion_mnist dataset work with the TensorFlow federated learning.
You're on the right track. To recap: the datasets returned by tff.simulation.dataset APIs are tff.simulation.ClientData objects. The object returned by tf.keras.datasets.fashion_mnist.load_data is a tuple of numpy arrays.
So what is needed is to implement a tff.simulation.ClientData to wrap the dataset returned by tf.keras.datasets.fashion_mnist.load_data. Some previous questions about implementing ClientData objects:
Federated learning : convert my own image dataset into tff simulation Clientdata
How define tff.simulation.ClientData.from_clients_and_fn Function?
Is there a reasonable way to create tff clients datat sets?
This does require answering an important question: how should the Fashion MNIST data be split into individual users? The dataset doesn't include features that that could be used for partitioning. Researchers have come up with a few ways to synthetically partition the data, e.g. randomly sampling some labels for each participant, but this will have a great effect on model training and is useful to invest some thought here.

Batch transform sparse matrix with AWS SageMaker Python SDK

I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).
I struggle because of:
the size of the data
the format of the data
I did several experiments to check what was going on:
1. predict locally on sample sparse matrix data
It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.
2. Batch Transform on a sample csv data
It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:
sklearn_model = SKLearnModel(
model_data=model_artifact_location_on_s3,
entry_point='my_script.py',
role=role,
sagemaker_session=sagemaker_session)
transformer = sklearn_model.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
max_payload=100)
transformer.transform(
data=batch_data,
content_type='text/csv',
split_type=None)
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
where:
'my_script.py' implements a simple model_fn to deserialize the model artifact:
def model_fn(model_dir):
clf = joblib.load(os.path.join(model_dir, "model.joblib"))
return clf
batch_data is the s3 path for the csv file.
3. Batch Transform of a sample dense numpy dataset.
It works
I prepared a sample of the data and saved it to s3 in Numpy .npy format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).
The only difference with the previous experiment (2) is the argument content_type='application/x-npy' in transformer.transform(...).
This solution does not scale and we would like to pass a Scipy sparse matrix:
4. Batch Transform of a big sparse matrix.
Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:
https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/
Errors running Sagemaker Batch Transformation with LDA model
I used write_spmatrix_to_sparse_tensor to write the data to protobuf format on s3. The function I used is:
def write_protobuf(X_sparse, bucket, prefix, obj):
"""Write sparse matrix to protobuf format at location bucket/prefix/obj."""
buf = io.BytesIO()
write_spmatrix_to_sparse_tensor(file=buf, array=X_sparse, labels=None)
buf.seek(0)
key = '{}/{}'.format(prefix, obj)
boto3.resource('s3').Bucket(bucket).Object(key).upload_fileobj(buf)
return 's3://{}/{}'.format(bucket, key)
Then the code used for launching the batch transform job is:
sklearn_model = SKLearnModel(
model_data=model_artifact_location_on_s3,
entry_point='my_script.py',
role=role,
sagemaker_session=sagemaker_session)
transformer = sklearn_model.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
max_payload=100)
transformer.transform(
data=batch_data,
content_type='application/x-recordio-protobuf',
split_type='RecordIO')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
I get the following error:
sagemaker_containers._errors.ClientError: Content type application/x-recordio-protobuf is not supported by this framework.
Questions:
(Reference doc for Transformer: https://sagemaker.readthedocs.io/en/stable/transformer.html)
If content_type='application/x-recordio-protobuf' is not allowed, what should I use?
Is split_type='RecordIO' the proper setting in this context?
Should I provide an input_fn function in my script to deserialize the data?
Is there another better approach to tackle this problem?

Categories

Resources