understanding what keras and TensorFlow to use in text classification - python

I was trying to classify my text in tensorflow and keras and every time I tried using the keras to read my files from my directory then it would through and error that the features for reading the text was not available yet it is included in the documentation
I made my own file reader functionality which is here how to read text files in keras using os.walk and converting to batched dataset and now trying to vectorize my text using keras preprocessing then again the module is not available
as asked in the comment I was trying to use keras https://keras.io/api/preprocessing/text/ guide and the one https://keras.io/examples/nlp/text_classification_from_scratch/ here which uses the vectorizing, I started digging into uisng tf to make tokens from the text and I found how to vectorize the text here https://www.tensorflow.org/tutorials/text/word2vec
but now the problem was that I could not use the functions because I could not understand it very well my code was as follows
train_dataset = get_files_from_dir(train_path,batch_size=batch_size, seed=seed) # calls the fetch text which returns dataset of text and labels batched
text_ds = train_dataset.map(lambda x, y: x) # get featues only (text with no labels)
def vec_maker(text):
tokens = text.lower().split()
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
if token not in vocab:
vocab[token] = index
index += 1
self.vocab =vocab
return text
now my problem is how do I map the text_ds to the function to make the vectors because if I try passing the variable as function arguments it direct is says that
File"/home/kim/Desktop/programs/python/text_processing/prog/text_process.py", line 78, in vec_maker
tokens = text.lower().split()
AttributeError: 'MapDataset' object has no attribute 'lower'
help and explanation will much be appreciated

Related

How can I add the decode_batch_predictions() method into the Keras Handwriting Recognition OCR model?

I need to add the decode_batch_predictions() method to the output of the Keras Handwriting Recognition OCR model. The reason for that is that I want to convert the model to TF Lite and I want the output to be decoded since I didn't find any way to decode the output on TF Lite in Android. I already saw a similar post for a similar Keras model but it wouldn't work for this model.
I have not much knowledge in Python so it's difficult for me to adapt the answers on that post for this model so I would really appreciate any help, thanks!
I tried using the code from that post but it wouldn't work
In the notebook for model given in your link, make the following changes after prediction_model:
prediction_model = keras.models.Model(
model.get_layer(name="image").input, model.get_layer(name="dense2").output
) # This line is present in the handwriting_recognition notebook.
def CTCDecoder():
def decode_batch_predictions(pred):
input_len = np.ones(pred.shape[0]) * pred.shape[1]
# Use greedy search. For complex tasks, you can use beam search
results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][:, :max_length]
# Iterate over the results and get back the text
output_text = []
for res in results:
#print(res)
res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
output_text.append(res)
return output_text
return tf.keras.layers.Lambda(decode_batch_predictions, name='decode')
decoded_pred_model = keras.models.Model(prediction_model.input, outputs=CTCDecoder()(prediction_model.output))
Convert the decoded_pred_model to a .tflite and use it in android.

Using dataset.map function passes Tensor without values

Hello i'm using tensorflow 2.10 and struggling with tensorflow data api. I wrote the following simple piece of code:
def loaddata(filepath, label):
data = np.load(filepath.numpy().decode())
return data, label
filenames = []
labels = []
/*appending some data into filenames and labels*/
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(loaddata)
However, i receive "AttributeError: 'Tensor' object has no attribute 'numpy'" error. I have search similar error but could not find effective way to pass tuples via dataset.map() function.
Any help will be appreciated.

How to load Fashion MNIST dataset in Tensorflow Fedarated?

I am working on a project with Tensorflow federated. I have managed to use the libraries provided by TensorFlow Federated Learning simulations in order to load, train, and test some datasets.
For example, i load the emnist dataset
emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data()
and it got the data sets returned by load_data() as instances of tff.simulation.ClientData. This is an interface that allows me to iterate over client ids and allow me to select subsets of the data for simulations.
len(emnist_train.client_ids)
3383
emnist_train.element_type_structure
OrderedDict([('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None)), ('label', TensorSpec(shape=(), dtype=tf.int32, name=None))])
example_dataset = emnist_train.create_tf_dataset_for_client(
emnist_train.client_ids[0])
I am trying to load the fashion_mnist dataset with Keras to perform some federated operations:
fashion_train,fashion_test=tf.keras.datasets.fashion_mnist.load_data()
but I get this error
AttributeError: 'tuple' object has no attribute 'element_spec'
because Keras returns a Tuple of Numpy arrays instead of a tff.simulation.ClientData like before:
def tff_model_fn() -> tff.learning.Model:
return tff.learning.from_keras_model(
keras_model=factory.retrieve_model(True),
input_spec=fashion_test.element_spec,
loss=loss_builder(),
metrics=metrics_builder())
iterative_process = tff.learning.build_federated_averaging_process(
tff_model_fn, Parameters.server_adam_optimizer_fn, Parameters.client_adam_optimizer_fn)
server_state = iterative_process.initialize()
To sum up,
Is any way to create tuple elements of tff.simulation.ClientData from Keras Tuple Numpy arrays?
Another solution that comes to my mind is to use the
tff.simulation.HDF5ClientData and load
manually the appropriate files in aHDF5format (train.h5, test.h5) in order to get the tff.simulation.ClientData, but my problem is that i cant find the url for fashion_mnist HDF5 file format i mean something like that for both train and test:
fileprefix = 'fed_emnist_digitsonly'
sha256 = '55333deb8546765427c385710ca5e7301e16f4ed8b60c1dc5ae224b42bd5b14b'
filename = fileprefix + '.tar.bz2'
path = tf.keras.utils.get_file(
filename,
origin='https://storage.googleapis.com/tff-datasets-public/' + filename,
file_hash=sha256,
hash_algorithm='sha256',
extract=True,
archive_format='tar',
cache_dir=cache_dir)
dir_path = os.path.dirname(path)
train_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_train.h5'))
test_client_data = hdf5_client_data.HDF5ClientData(
os.path.join(dir_path, fileprefix + '_test.h5'))
return train_client_data, test_client_data
My final goal is to make the fashion_mnist dataset work with the TensorFlow federated learning.
You're on the right track. To recap: the datasets returned by tff.simulation.dataset APIs are tff.simulation.ClientData objects. The object returned by tf.keras.datasets.fashion_mnist.load_data is a tuple of numpy arrays.
So what is needed is to implement a tff.simulation.ClientData to wrap the dataset returned by tf.keras.datasets.fashion_mnist.load_data. Some previous questions about implementing ClientData objects:
Federated learning : convert my own image dataset into tff simulation Clientdata
How define tff.simulation.ClientData.from_clients_and_fn Function?
Is there a reasonable way to create tff clients datat sets?
This does require answering an important question: how should the Fashion MNIST data be split into individual users? The dataset doesn't include features that that could be used for partitioning. Researchers have come up with a few ways to synthetically partition the data, e.g. randomly sampling some labels for each participant, but this will have a great effect on model training and is useful to invest some thought here.

Pre processing keras dataset using keras tokenizer

I am trying to do some pre processing using the keras tokenizer on data I read using the following code:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.interleave(lambda x:
tf.data.TFRecordDataset(x).prefetch(params.num_parallel_readers),
cycle_length=params.num_parallel_readers,
block_length=1)
dataset = dataset.map(_parse_example, num_parallel_calls = params.num_parallel_calls)
Now that I have the parsed example (output of _parse_example map function) I want to do some pre-processing on the text using tf.keras.preprocessing.text.Tokenizer method texts_to_sequences.
However, texts_to_sequences expects an input of python strings and I get Tensors in the parsed_example.
I can work around it by using py_func to wrap my code (see 'emb': tf.py_func.. in the code below), but then I will not be able to serialize my model (according to the py_func documentation).
dataset = dataset.map(lambda features, labels:
({'window': features['window'],
'winSize': features['winSize'],
'LandingPage': features['LandingPage'],
'emb': tf.py_func(getEmb, [features['window']], tf.int32)},
tf.one_hot(labels, hparams.numClasses) ))
Looking for a way to do that (or a link to some similar example)

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work. It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions.
Equivalent to my code:
import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)
That creates the model, saves the vectors, and then prints the results out nice and pretty in a tab delimited file with values for all of the dimensions. I understand how to do what I'm doing, I just can't figure out what's wrong with the way I put it in tensorflow, as the documentation regarding that is pretty scarce as far as I can tell.
One idea that has been presented to me is implementing the appropriate tensorflow code, but I don’t know how to code that, just import files in the live demo.
Edit: I have a new problem now. The object I have my vectors in is non-iterable because gensim apparently decided to make its own data structures that are non-compatible with what I'm trying to do.
Ok. Done with that too! Thanks for your help!
What you are describing is possible. What you have to keep in mind is that Tensorboard reads from saved tensorflow binaries which represent your variables on disk.
More information on saving and restoring tensorflow graph and variables here
The main task is therefore to get the embeddings as saved tf variables.
Assumptions:
in the following code embeddings is a python dict {word:np.array (np.shape==[embedding_size])}
python version is 3.5+
used libraries are numpy as np, tensorflow as tf
the directory to store the tf variables is model_dir/
Step 1: Stack the embeddings to get a single np.array
embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]
Step 2: Save the tf.Variable on disk
# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')
# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Save the variables to disk.
save_path = saver.save(sess, "model_dir/model.ckpt")
print("Model saved in path: %s" % save_path)
model_dir should contain files checkpoint, model.ckpt-1.data-00000-of-00001, model.ckpt-1.index, model.ckpt-1.meta
Step 3: Generate a metadata.tsv
To have a beautiful labeled cloud of embeddings, you can provide tensorboard with metadata as Tab-Separated Values (tsv) (cf. here).
words = '\n'.join(list(embeddings.keys()))
with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
f.write(words)
# .tsv file written in model_dir/metadata.tsv
Step 4: Visualize
Run $ tensorboard --logdir model_dir -> Projector.
To load metadata, the magic happens here:
As a reminder, some word2vec embedding projections are also available on http://projector.tensorflow.org/
Gensim actually has the official way to do this.
Documentation about it
The above answers didn't work for me. What I found out pretty useful was this script (will be added to gensim in the future) Source
To transform the data to metadata:
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
with open( metadatafp, 'w+') as metadata:
for word in model.index2word:
encoded=word.encode('utf-8')
metadata.write(encoded + '\n')
vector_row = '\t'.join(map(str, model[word]))
tensors.write(vector_row + '\n')
Or follow this gist
the gemsim provide convert method word2vec to tf projector file
python -m gensim.scripts.word2vec2tensor -i ~w2v_model_file -o output_folder
add in projector wesite, upload the metadata

Categories

Resources