Fine tune Tensorflow Universal sentence encoder - python

I want to fine tune Tensorflow Universal sentence encoder. But when I try to use own data as input to Keras layer with sentence encoder I got the error
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
hub_layer = hub.KerasLayer(model, output_shape=[512], dtype=tf.string)
hub_layer(np.array('test sentence'))
InvalidArgumentError: input must be a vector, got shape: []
I tried different variants of input data: strings, numpy arrays, but it didn't work. Does anybody know what this model takes as input and how can I adapt my own data for this?

Related

torch: Error loading caffe2_detectron_ops_gpu.dll or one of its dependencies

Setup
I have Anaconda virtual environment on a Windows machine. Torch, transformers, tensorflow and CUDA installed. I previously used GPU acceleration from the transformers pipeline.
What I want to do ultimately
I want to use BERT to take word embeddings of the text in my dataset, and input that in LDA to do topic modeling. The pseudo-code I intend to run:
import pandas as pd
import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel
# Load your dataset into a pandas dataframe
df = pd.read_csv("topic_modeling_input_dataset.csv")
# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the reviews in the dataframe
df["tokenized_reviews"] = df["review"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
# Convert the tokenized reviews to tensors
input_ids = tf.constant(list(df["tokenized_reviews"]))
# Extract the word embeddings using the pre-trained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")
_, word_embeddings = bert_model(input_ids)
# Convert the word embeddings from tensors to numpy arrays
word_embeddings = word_embeddings.numpy()
# Average the word embeddings for each review to obtain sentence embeddings
sentence_embeddings = np.mean(word_embeddings, axis=1)
# Use the sentence embeddings as input to Latent Dirichlet Allocation (LDA) for topic modeling
from sklearn.decomposition import LatentDirichletAllocation
# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=10)
# Fit the LDA model on the sentence embeddings
lda_model.fit(sentence_embeddings)
# Print the topics learned by the LDA model
for index, topic in enumerate(lda_model.components_):
print(f"Topic {index}:")
words = [tokenizer.convert_ids_to_tokens[i] for i in np.argsort(topic)[::-1][:10]]
print(words)
But can't get past through importing the libraries
Problem
The command
from transformers import BertTokenizer, TFBertModel gives the error:
RuntimeError: Failed to import transformers.models.bert.modeling_tf_bert because of the following error (look up to see its traceback):
Failed to import transformers.data.data_collator because of the following error (look up to see its traceback):
[WinError 182] The operating system cannot run %1. Error loading "C:\Users\myuser\Anaconda3\envs\text_mining\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Debugging Attempt
In the directory, I only have caffe2_detectron_ops_gpu.dll and no caffe2_detectron_ops.dll, which was the problem in all reported cases I read online.
I also tried reinstalling caffe2 in conda, but can't get a clean command or way to do it. caffe2 documentation mentions that the commands could have unresolved bugs.

using onnx runtime time to run inference on keras converted onnx model

I am attempting to run inference on my .onnx model converted from a keras' multi-label text classification model using https://keras.io/examples/nlp/multi_label_classification/. This is a text classification model that takes in text and provides a predicted category.
I am following this tutorial here: https://github.com/onnx/keras-onnx/blob/master/tutorial/TensorFlow_Keras_MNIST.ipynb BUT I am not sure what I am missing with regards to finding the format for 'feed'.
The keras model looks like this:
def make_model():
shallow_mlp_model = keras.Sequential(
[
layers.Dense(512, activation="relu"),
layers.Dense(256, activation="relu"),
layers.Dense(lookup.vocabulary_size(), activation="sigmoid"),
]
)
return shallow_mlp_model
The feed is a dictionary of input name to data. In the original tutorial the data for the input named 'dense_input' was created with this:
data = [digit_image.astype(np.float32)]
The data needs to be a numpy array as ONNX Runtime knows nothing about BatchDataset (based on the output in your question that's the type returned by make_dataset).

BERT Domain Adaptation

I am using transformers.BertForMaskedLM to further pre-train the BERT model on my custom dataset. I first serialize all the text to a .txt file by separating the words by a whitespace. Then, I am using transformers.TextDataset to load the serialized data with a BERT tokenizer given as tokenizer argument. Then, I am using BertForMaskedLM.from_pretrained() to load the pre-trained model (which is what transformers library presents). Then, I am using transformers.Trainer to further pre-train the model on my custom dataset, i.e., domain adaptation, for 3 epochs. I save the model with trainer.save_model(). Then, I want to load the further pre-trained model to get the embeddings of the words in my custom dataset. To load the model, I am using AutoModel.from_pretrained() but this pops up a warning.
Some weights of the model checkpoint at {path to my further pre-trained model} were not used when initializing BertModel
So, I know why this pops up. Because I further pre-trained using transformers.BertForMaskedLM but when I load with transformers.AutoModel, it loads it as transformers.BertModel. What I do not understand is if this is a problem or not. I just want to get the embeddings, e.g., embedding vector with a size of 768.
You saved a BERT model with LM head attached. Now you are going to load the serialized file into a standalone BERT structure without any extra element and the warning is issued. This is pretty normal and there is no Fatal error to do so! You can check the list of unloaded params like below:
from transformers import BertTokenizer, BertModel
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
import torch
lmbert = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
lmbert.save_pretrained('you_desired_path/BertLMHeadModel')
lmbert_params = []
for name, param in lmbert.named_parameters():
lmbert_params.append(name)
bert = BertModel.from_pretrained('you_desired_path/BertLMHeadModel')
bert_params = []
for name, param in bert.named_parameters():
bert_params.append(name)
params_ralated_to_lm_head = [param_name for param_name in lmbert_params if param_name.replace('bert.', '') not in bert_params]
params_ralated_to_lm_head
output:
['cls.predictions.bias',
'cls.predictions.transform.dense.weight',
'cls.predictions.transform.dense.bias',
'cls.predictions.transform.LayerNorm.weight',
'cls.predictions.transform.LayerNorm.bias']

Printing label for Tf Lite model Image Classification

I am working on a Image Claasification TF Lite model to detect mask or no mask from human faces using this link. I followed the link and trained an image multi class classification in vertex AI and downloaded the TF lite model. The labels of the model are "mask" and "no_mask". In order to test the model, I wrote the following code:
interpret= tf.lite.Interpreter(model_path="<FILE_PATH>")
input= interpret.get_input_details()
output= interpret.get_output_details()
interpret.allocate_tensors()
pprint(input)
pprint(output)
data= cv2.imread("file.jpeg")
new_image= cv2.resize(data,(224,224))
interpret.resize_tensor_input(input[0]["index"],[1,224,224,3])
interpret.allocate_tensors()
interpret.set_tensor(input[0]["index"],[new_image])
interpret.invoke()
result= interpret.get_tensor(output[0]['index'])
print (" Prediction is - {}".format(result))
Using this code for one of my image is giving me the result as :
[[30 246]]
Now I want to print the label in the result as well. For example:
mask: 30
no_mask: 46
Is there any way I can implement this?
Please help as I am new to TF Lite
I solved it myself. The .tflite model downloaded from Vertex AI contains the label file called 'dict.txt' that contains all the labels. Check the GCP doc here. To get this label file we first need to unzip the .tflite file which will give us the dict.txt. For more information, check out the tflite documentation and how to read associate file from the models.
After that I executed the following code taking reference from the github link label.py:
import argparse
import time
import numpy as np
from PIL import Image
import tensorflow as tf
interpret= tf.lite.Interpreter(model_path="<FILE_PATH>")
input= interpret.get_input_details()
output= interpret.get_output_details()
interpret.allocate_tensors()
pprint(input)
pprint(output)
data= cv2.imread("file.jpeg")
new_image= cv2.resize(data,(224,224))
interpret.resize_tensor_input(input[0]["index"],[1,224,224,3])
interpret.allocate_tensors()
interpret.set_tensor(input[0]["index"],[new_image])
interpret.invoke()
floating_model= input[0]['dtype'] == np.float32
op_data= interpret.get_tensor(output[0]['index'])
result= np.squeeze(op_data)
top_k=result.agrsort()[-5:][::1]
labels=load_labels("dict.txt")
for i in top_k:
if floating_model:
print('{:08.6f}: {}'.format(float(result[i]), labels[i]))
else:
print('{:08.6f}: {}'.format(float(result[i] / 255.0), labels[i]))

How to obtain input data from ONNX model?

I have exported my PyTorch model to ONNX. Now, is there a way for me to obtain the input layer from that ONNX model?
Exporting PyTorch model to ONNX
import torch.onnx
checkpoint = torch.load("./saved_pytorch_model.pth")
model.load_state_dict(checkpoint['state_dict'])
input = torch.tensor(df_X.values).float()
torch.onnx.export(model, input, "onnx_model.onnx")
Loading ONNX model
onnx_model = onnx.load('onnx_model.onnx')
I want to be able to somehow obtain the input layer from onnx_model. Is this possible?
The ONNX model is a protobuf structure, as defined here (https://github.com/onnx/onnx/blob/master/onnx/onnx.in.proto). You can work with it using the standard protobuf methods generated for python (see: https://developers.google.com/protocol-buffers/docs/reference/python-generated). I don't understand what exactly you want to extract. But you can iterate through the nodes that make up the graph (model.graph.node). The first node in the graph may or may not correspond to what you might consider the first layer (it depends on how the translation was done). You can also get the inputs of the model (model.graph.input).
Onnx library provides APIs to extract the names and shapes of all the inputs as follows:
model = onnx.load(onnx_model)
inputs = {}
for inp in model.graph.input:
shape = str(inp.type.tensor_type.shape.dim)
inputs[inp.name] = [int(s) for s in shape.split() if s.isdigit()]

Categories

Resources