Visualize Gensim Word2vec Embeddings in Tensorboard Projector - python

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work. It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions.
Equivalent to my code:
import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)
That creates the model, saves the vectors, and then prints the results out nice and pretty in a tab delimited file with values for all of the dimensions. I understand how to do what I'm doing, I just can't figure out what's wrong with the way I put it in tensorflow, as the documentation regarding that is pretty scarce as far as I can tell.
One idea that has been presented to me is implementing the appropriate tensorflow code, but I don’t know how to code that, just import files in the live demo.
Edit: I have a new problem now. The object I have my vectors in is non-iterable because gensim apparently decided to make its own data structures that are non-compatible with what I'm trying to do.
Ok. Done with that too! Thanks for your help!

What you are describing is possible. What you have to keep in mind is that Tensorboard reads from saved tensorflow binaries which represent your variables on disk.
More information on saving and restoring tensorflow graph and variables here
The main task is therefore to get the embeddings as saved tf variables.
Assumptions:
in the following code embeddings is a python dict {word:np.array (np.shape==[embedding_size])}
python version is 3.5+
used libraries are numpy as np, tensorflow as tf
the directory to store the tf variables is model_dir/
Step 1: Stack the embeddings to get a single np.array
embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]
Step 2: Save the tf.Variable on disk
# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')
# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Save the variables to disk.
save_path = saver.save(sess, "model_dir/model.ckpt")
print("Model saved in path: %s" % save_path)
model_dir should contain files checkpoint, model.ckpt-1.data-00000-of-00001, model.ckpt-1.index, model.ckpt-1.meta
Step 3: Generate a metadata.tsv
To have a beautiful labeled cloud of embeddings, you can provide tensorboard with metadata as Tab-Separated Values (tsv) (cf. here).
words = '\n'.join(list(embeddings.keys()))
with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
f.write(words)
# .tsv file written in model_dir/metadata.tsv
Step 4: Visualize
Run $ tensorboard --logdir model_dir -> Projector.
To load metadata, the magic happens here:
As a reminder, some word2vec embedding projections are also available on http://projector.tensorflow.org/

Gensim actually has the official way to do this.
Documentation about it

The above answers didn't work for me. What I found out pretty useful was this script (will be added to gensim in the future) Source
To transform the data to metadata:
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
with open( metadatafp, 'w+') as metadata:
for word in model.index2word:
encoded=word.encode('utf-8')
metadata.write(encoded + '\n')
vector_row = '\t'.join(map(str, model[word]))
tensors.write(vector_row + '\n')
Or follow this gist

the gemsim provide convert method word2vec to tf projector file
python -m gensim.scripts.word2vec2tensor -i ~w2v_model_file -o output_folder
add in projector wesite, upload the metadata

Related

How can I add the decode_batch_predictions() method into the Keras Handwriting Recognition OCR model?

I need to add the decode_batch_predictions() method to the output of the Keras Handwriting Recognition OCR model. The reason for that is that I want to convert the model to TF Lite and I want the output to be decoded since I didn't find any way to decode the output on TF Lite in Android. I already saw a similar post for a similar Keras model but it wouldn't work for this model.
I have not much knowledge in Python so it's difficult for me to adapt the answers on that post for this model so I would really appreciate any help, thanks!
I tried using the code from that post but it wouldn't work
In the notebook for model given in your link, make the following changes after prediction_model:
prediction_model = keras.models.Model(
model.get_layer(name="image").input, model.get_layer(name="dense2").output
) # This line is present in the handwriting_recognition notebook.
def CTCDecoder():
def decode_batch_predictions(pred):
input_len = np.ones(pred.shape[0]) * pred.shape[1]
# Use greedy search. For complex tasks, you can use beam search
results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][:, :max_length]
# Iterate over the results and get back the text
output_text = []
for res in results:
#print(res)
res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
output_text.append(res)
return output_text
return tf.keras.layers.Lambda(decode_batch_predictions, name='decode')
decoded_pred_model = keras.models.Model(prediction_model.input, outputs=CTCDecoder()(prediction_model.output))
Convert the decoded_pred_model to a .tflite and use it in android.

How to convert TensorFlow checkpoint files to TensorFlowJS?

I have a project that was developed on TensorFlow v1 I think. It works in Python 3.8 like this:
...
saver = tf.train.Saver(var_list=vars)
...
saver.restore(self.sess, tf.train.latest_checkpoint(checkpoint_dir))
...
The checkpoint files reside in the "checkpoint_dir"
I would like to use this with TFjs but I can't figure out how to transform the checkpoint files to something that can be loaded with TFjs.
What should I do?
thanks,
John
Ok, I figured it out. Hope this helps other beginners like me too.
The checkpoint files do not contain the model, they only contain the values (weights, etc) of the model.
The model is actually built in the code. So, here are the steps to convert the Tensorflow v1 checkpoint files to TensorflowJS loadable model:
First I saved the checkpoint again because there was a file that was missing (.meta file) This contains some meta information about the values in the checkpoint. To save the checkpoint with meta I used this code right after the saver.restore(... call like this:
...
saver.save(self.sess,save_path='./newcheckpoint/')
...
Save the model as a frozen model file like this:
import tensorflow.compat.v1 as tf
meta_path = './newcheckpoint/.meta' # Your .meta file
output_node_names = ['name_of_the_output_node'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('./newcheckpoint/'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('./freeze/output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
This will save the model to ./freeze/output_graph.pb
Using tensorflowjs_converter convert the frozen model to a web model like this:
tensorflowjs_converter --input_format=tf_frozen_model --output_node_names='final_add' --skip_op_check ./freeze/output_graph.pb ./web_model/
Had to use the --skip_op_check due to some missing op errors/warnings when trying to convert.
As a result of step 3, the ./webmodel/ folder will contain the JSON and binary files required by the TensorflowJS library.
Here's how I load the model using tfjs 2.x:
model=await tf.loadGraphModel('web_model/model.json');

How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef?

I'm trying to convert these three files of a pre-trained model:
semantic_model.data-00000-of-00001
semantic_model.index
semantic_model.meta
into a Saved Model format, so that I can later convert it into TFLite format for Inference.
Searching StackOverflow, I'd come across this code, which properly generates the Saved_model.pb, however as noted in some comments, doing it in this way doesn't keep the Meta Graph Definitions, which causes an error when I later try to convert it into TFlite format or freeze it.
import os
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
export_dir = '/tf-end-to-end/export_dir'
#trained_checkpoint_prefix = 'Models/semantic_model' \tf-end-to-end\Models
trained_checkpoint_prefix = 'PATH TO MODEL DIRECTORY'
tf.reset_default_graph()
graph = tf.Graph()
loader = tf.train.import_meta_graph(trained_checkpoint_prefix + ".meta" )
sess = tf.Session()
loader.restore(sess,trained_checkpoint_prefix)
builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.TRAINING, tf.saved_model.tag_constants.SERVING], strip_default_attrs=True)
builder.save()
This is the error I get when trying to use the saved_model:
RuntTimeError: MetaGraphDef associated with tags {'serve'} could not be found in SavedModel
Running the showsavedmodelcli --all doesn't display anything under signature definitions for the created saved_model.
My question is, how do I maintain the data and convert this to saved_model, for later conversion into TFLite format?
Model Structure and creation details can be seen here, including the checkpoint files mentioned: https://github.com/OMR-Research/tf-end-to-end
Refer to these steps for converting checkpoints to a TFLite model: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/r1/convert/python_api.md#convert-checkpoints-

How to validate data during inference using tfx / tfdv / tensorflow serving?

I'm building a tfx pipeline and using tensorflow serving to serve my model. I save the signature with model.save(...).
So far I was able to use the transform layer to transform the feature before prediction with tf_transform_output.transform_features_layer() (see my code below).
However, I'm wondering how one can detect anomalies in the input data? For instance, I don't want to predict for an input value that is too far away from the distribution that a feature was trained with before.
The tfdv library offers functions like generate_statistics_from_[csv|dataframe|tfrecord] but I was not able to find any good example to generate statistics for serialized tf.Examples (or something that is not saved in a file, like csv, tfrecords etc.).
I'm aware of the following example in the documentation:
import tensorflow_data_validation as tfdv
import tfx_bsl
import pyarrow as pa
decoder = tfx_bsl.coders.example_coder.ExamplesToRecordBatchDecoder()
example = decoder.DecodeBatch([serialized_tfexample])
options = tfdv.StatsOptions(schema=schema)
anomalies = tfdv.validate_instance(example, options)
But in this example serialized_tfexample is a string, whereas in my code below the argument serialized_tf_examples is a Tensor of strings.
Sorry if this is an obvious question. I spent all day to find a solution without success. Maybe I'm getting this all thing wrong. Maybe this is not the right place to put validations. So my more generalized question is actually: How do you validate incoming input data before prediction when you serve a model, which you created through a tfx pipeline, in production?
I'm thankful for any lead into the right direction.
Here is my code to which I want to add validation:
...
tf_transform_output = tft.TFTransformOutput(...)
model.tft_layer = tf_transform_output.transform_features_layer()
#tf.function(input_signature=[
tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')
])
def serve_tf_examples_fn(serialized_tf_examples):
#### How can I generate stats and validate serialized_tf_examples? ###
#### Is this the right place? ###
feature_spec = tf_transform_output.raw_feature_spec()
feature_spec.pop(TARGET_LABEL)
parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)
transformed_features = model.tft_layer(parsed_features)
return model(transformed_features)
...
model.save(serving_model_dir,
save_format='tf',
signatures={
'serving_default': serve_tf_examples_fn
})

How do you find the output_node_names from a frozen model (a pb file) in Tensorflow?

I am trying to convert my frozen_model.pb to a tensorflow JS compatible (.pb) file, which is based on SSD Mobilenet V2 COCO pretrained model by Tensorflow. I am stuck at how to get the output_node_names parameter which is needed while using the tensorflowjs_converter. How do I get to know the output node names?
I have tried to get the operation names by using the below Python script, but am not able to understand which one is the output node.
def load_graph(model_file):
graph = tf.Graph()
graph_def = tf.GraphDef()
with open(model_file, "rb") as f:
graph_def.ParseFromString(f.read())
with graph.as_default():
tf.import_graph_def(graph_def)
return graph
graph = load_graph('frozen_model.pb')
ops = graph.get_operations()
Firstly, you can inspect all the nodes in your graph_def as follows:
for node in graph_def.node
print(node.name)
Alternatively, if you want to visually see the graph and determine which node to be used as output, using TensorBoard is the way to go. There is a tool called import_pb_to_tensorboard. It's essentially using a handful of lines to write the graph to a log_dir, which you can point tensorboard to. You can simply copy those lines into your own script to achieve the same thing without building from the tensorflow repo.
Thirdly, there is another tool called summarize_graph tool:
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=/path/to/your/graph.pb

Categories

Resources