BERT Domain Adaptation

BERT Domain Adaptation - python

I am using transformers.BertForMaskedLM to further pre-train the BERT model on my custom dataset. I first serialize all the text to a .txt file by separating the words by a whitespace. Then, I am using transformers.TextDataset to load the serialized data with a BERT tokenizer given as tokenizer argument. Then, I am using BertForMaskedLM.from_pretrained() to load the pre-trained model (which is what transformers library presents). Then, I am using transformers.Trainer to further pre-train the model on my custom dataset, i.e., domain adaptation, for 3 epochs. I save the model with trainer.save_model(). Then, I want to load the further pre-trained model to get the embeddings of the words in my custom dataset. To load the model, I am using AutoModel.from_pretrained() but this pops up a warning.
Some weights of the model checkpoint at {path to my further pre-trained model} were not used when initializing BertModel
So, I know why this pops up. Because I further pre-trained using transformers.BertForMaskedLM but when I load with transformers.AutoModel, it loads it as transformers.BertModel. What I do not understand is if this is a problem or not. I just want to get the embeddings, e.g., embedding vector with a size of 768.

You saved a BERT model with LM head attached. Now you are going to load the serialized file into a standalone BERT structure without any extra element and the warning is issued. This is pretty normal and there is no Fatal error to do so! You can check the list of unloaded params like below:
from transformers import BertTokenizer, BertModel
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
import torch
lmbert = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
lmbert.save_pretrained('you_desired_path/BertLMHeadModel')
lmbert_params = []
for name, param in lmbert.named_parameters():
lmbert_params.append(name)
bert = BertModel.from_pretrained('you_desired_path/BertLMHeadModel')
bert_params = []
for name, param in bert.named_parameters():
bert_params.append(name)
params_ralated_to_lm_head = [param_name for param_name in lmbert_params if param_name.replace('bert.', '') not in bert_params]
params_ralated_to_lm_head
output:
['cls.predictions.bias',
'cls.predictions.transform.dense.weight',
'cls.predictions.transform.dense.bias',
'cls.predictions.transform.LayerNorm.weight',
'cls.predictions.transform.LayerNorm.bias']

Related

torch: Error loading caffe2_detectron_ops_gpu.dll or one of its dependencies

Setup
I have Anaconda virtual environment on a Windows machine. Torch, transformers, tensorflow and CUDA installed. I previously used GPU acceleration from the transformers pipeline.
What I want to do ultimately
I want to use BERT to take word embeddings of the text in my dataset, and input that in LDA to do topic modeling. The pseudo-code I intend to run:
import pandas as pd
import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel
# Load your dataset into a pandas dataframe
df = pd.read_csv("topic_modeling_input_dataset.csv")
# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the reviews in the dataframe
df["tokenized_reviews"] = df["review"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
# Convert the tokenized reviews to tensors
input_ids = tf.constant(list(df["tokenized_reviews"]))
# Extract the word embeddings using the pre-trained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")
_, word_embeddings = bert_model(input_ids)
# Convert the word embeddings from tensors to numpy arrays
word_embeddings = word_embeddings.numpy()
# Average the word embeddings for each review to obtain sentence embeddings
sentence_embeddings = np.mean(word_embeddings, axis=1)
# Use the sentence embeddings as input to Latent Dirichlet Allocation (LDA) for topic modeling
from sklearn.decomposition import LatentDirichletAllocation
# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=10)
# Fit the LDA model on the sentence embeddings
lda_model.fit(sentence_embeddings)
# Print the topics learned by the LDA model
for index, topic in enumerate(lda_model.components_):
print(f"Topic {index}:")
words = [tokenizer.convert_ids_to_tokens[i] for i in np.argsort(topic)[::-1][:10]]
print(words)
But can't get past through importing the libraries
Problem
The command
from transformers import BertTokenizer, TFBertModel gives the error:
RuntimeError: Failed to import transformers.models.bert.modeling_tf_bert because of the following error (look up to see its traceback):
Failed to import transformers.data.data_collator because of the following error (look up to see its traceback):
[WinError 182] The operating system cannot run %1. Error loading "C:\Users\myuser\Anaconda3\envs\text_mining\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Debugging Attempt
In the directory, I only have caffe2_detectron_ops_gpu.dll and no caffe2_detectron_ops.dll, which was the problem in all reported cases I read online.
I also tried reinstalling caffe2 in conda, but can't get a clean command or way to do it. caffe2 documentation mentions that the commands could have unresolved bugs.

What preprocessing is done by Tensorflow Lite Image Searcher ImageDataLoader

I'm trying to use Tensorflow Lite Image Searcher with mobilenet_v3 to query a database for a similar image and I'm getting surprisingly bad results.
Thus, I suspect there is a mistake in one of my steps.
Here is my code:
from tflite_model_maker import searcher
from tflite_support.task import vision
# Load pretrained model:
model_name = "lite-model_imagenet_mobilenet_v3_large_100_224_feature_vector_5_metadata_1.tflite"
data_loader = searcher.ImageDataLoader.create(model_name)
# Load db images, calc feature vectors
path_db = 'dir_with_db_jpg_images'
data_loader.load_from_folder(path_db)
# Set model:
scann_options = searcher.ScaNNOptions(
distance_measure="dot_product",
tree=searcher.Tree(num_leaves=10, num_leaves_to_search=2),
score_ah=searcher.ScoreAH(2, anisotropic_quantization_threshold=0.2))
model = searcher.Searcher.create_from_data(data_loader, scann_options)
# Export as tflite model:
model.export(
export_filename="searcher.tflite",
userinfo="",
export_format=searcher.ExportFormat.TFLITE)
# Load model:
image_searcher = vision.ImageSearcher.create_from_file("searcher.tflite")
# Predict NN for query image:
image = vision.TensorImage.create_from_file('path_query_img.jpg')
result = image_searcher.search(image)
result.nearest_neighbors
Do the bad results stem from missing preprocessing steps for the input images (DB or query)?
What happens to the images once load_from_folder(path) is called before the feature vector is created? And is this different from image_searcher.search(vision.TensorImage.create_from_file('path_query_img.jpg')) which might explain the bad results.
I couldn't figure out a way to first load the DB images and then feed them to the model for it to extract the feature vector and append it to the searcher model. Do such methods exist? That would allow me to experiment more with preprocessing steps (e.g. resizing the images to the 224x244 expected image size of the network - which I tried).
Maybe there is another problem with my code?
Thank you!

How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef?

I'm trying to convert these three files of a pre-trained model:
semantic_model.data-00000-of-00001
semantic_model.index
semantic_model.meta
into a Saved Model format, so that I can later convert it into TFLite format for Inference.
Searching StackOverflow, I'd come across this code, which properly generates the Saved_model.pb, however as noted in some comments, doing it in this way doesn't keep the Meta Graph Definitions, which causes an error when I later try to convert it into TFlite format or freeze it.
import os
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
export_dir = '/tf-end-to-end/export_dir'
#trained_checkpoint_prefix = 'Models/semantic_model' \tf-end-to-end\Models
trained_checkpoint_prefix = 'PATH TO MODEL DIRECTORY'
tf.reset_default_graph()
graph = tf.Graph()
loader = tf.train.import_meta_graph(trained_checkpoint_prefix + ".meta" )
sess = tf.Session()
loader.restore(sess,trained_checkpoint_prefix)
builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.TRAINING, tf.saved_model.tag_constants.SERVING], strip_default_attrs=True)
builder.save()
This is the error I get when trying to use the saved_model:
RuntTimeError: MetaGraphDef associated with tags {'serve'} could not be found in SavedModel
Running the showsavedmodelcli --all doesn't display anything under signature definitions for the created saved_model.
My question is, how do I maintain the data and convert this to saved_model, for later conversion into TFLite format?
Model Structure and creation details can be seen here, including the checkpoint files mentioned: https://github.com/OMR-Research/tf-end-to-end

Refer to these steps for converting checkpoints to a TFLite model: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/r1/convert/python_api.md#convert-checkpoints-

Fine tune Tensorflow Universal sentence encoder

I want to fine tune Tensorflow Universal sentence encoder. But when I try to use own data as input to Keras layer with sentence encoder I got the error
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
hub_layer = hub.KerasLayer(model, output_shape=[512], dtype=tf.string)
hub_layer(np.array('test sentence'))
InvalidArgumentError: input must be a vector, got shape: []
I tried different variants of input data: strings, numpy arrays, but it didn't work. Does anybody know what this model takes as input and how can I adapt my own data for this?

How to obtain input data from ONNX model?

I have exported my PyTorch model to ONNX. Now, is there a way for me to obtain the input layer from that ONNX model?
Exporting PyTorch model to ONNX
import torch.onnx
checkpoint = torch.load("./saved_pytorch_model.pth")
model.load_state_dict(checkpoint['state_dict'])
input = torch.tensor(df_X.values).float()
torch.onnx.export(model, input, "onnx_model.onnx")
Loading ONNX model
onnx_model = onnx.load('onnx_model.onnx')
I want to be able to somehow obtain the input layer from onnx_model. Is this possible?

The ONNX model is a protobuf structure, as defined here (https://github.com/onnx/onnx/blob/master/onnx/onnx.in.proto). You can work with it using the standard protobuf methods generated for python (see: https://developers.google.com/protocol-buffers/docs/reference/python-generated). I don't understand what exactly you want to extract. But you can iterate through the nodes that make up the graph (model.graph.node). The first node in the graph may or may not correspond to what you might consider the first layer (it depends on how the translation was done). You can also get the inputs of the model (model.graph.input).

Onnx library provides APIs to extract the names and shapes of all the inputs as follows:
model = onnx.load(onnx_model)
inputs = {}
for inp in model.graph.input:
shape = str(inp.type.tensor_type.shape.dim)
inputs[inp.name] = [int(s) for s in shape.split() if s.isdigit()]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BERT Domain Adaptation - python

Related

torch: Error loading caffe2_detectron_ops_gpu.dll or one of its dependencies

What preprocessing is done by Tensorflow Lite Image Searcher ImageDataLoader

How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef?

Fine tune Tensorflow Universal sentence encoder

How to obtain input data from ONNX model?

Categories

Resources