using pipelines with a local model

using pipelines with a local model - python

I am trying to use a simple pipeline offline. I am only allowed to download files directly from the web.
I went to https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/tree/main and downloaded all the files in a local folder C:\\Users\\me\\mymodel
However, when I tried to load the model I get a strange error
from transformers import pipeline
classifier = pipeline(task= 'sentiment-analysis',
model= "C:\\Users\\me\\mymodel",
tokenizer = "C:\\Users\\me\\mymodel")
ValueError: unable to parse C:\Users\me\mymodel\modelcard.json as a URL or as a local path
What is the issue here?
Thanks!

Must be either of the two cases:
You didn't download all the required files properly
Folder path is wrong
FYI, I am listing out the required contents in the directory:
config.json
pytorch_model.bin/ tf_model.h5
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt

the solution was slightly indirect:
load the model on a computer with internet access
save the model with save_pretrained()
transfer the folder obtained above to the offline machine and point its path in the pipeline call
The folder will contain all the expected files.

Related

How to use config.json file for Hugging Face transformer model loading from Google Cloud Storage (GCS)?

I'd like to be able to load a Hugging Face transformer base (xlm-roberta-base) from GCS. However when loading using the pytorch_model.bin file, it requires a directory containing config.json file to be given as argument. However obviously GCS buckets do not act like regular directories. How can I achieve this?
So far what I have attempted is something like this:
fs = gcsfs.GCSFileSystem(project="{project_name}")
XLMRobertaModel.from_pretrained(fs.cat("{bucket}/xlm-roberta-base/pytorch_model.bin"),
from_pt=True, config=fs.cat("{bucket}/xlm-roberta-base/config.json"))
This produces error message:
OSError: Can't load the configuration of '<File-like object GCSFileSystem, bucket/xlm-roberta-base/config.json>'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '<File-like object GCSFileSystem, bucket/xlm-roberta-base/config.json>' is the correct path to a directory containing a config.json file
I know fs.cat("{bucket}/xlm-roberta-base/config.json") is not going to return a path to a directory, but I'm not sure what I should give as argument given the directory is in a GCS bucket.
Is it possible to do this at all?

How to download and use the universal sentence encoder instead of loading it from url

I am using the universal sentence encoder to find sentence similarity. below is the code that i use to load the model
import tensorflow_hub as hub
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")
Here, instead of lookin at the url in tensorflow hub, Is there a way to download the model programmatically to a local location and load the model from local filesystem

Well i am not sure and i haven't tried it but i checked the source of the hub.load() and i found some interesting facts may be they help you for your problem
First of all the doc says
This function is roughly equivalent to the TF2 function
tf.saved_model.load() on the result of hub.resolve(handle).
Calling this function requires TF 1.14 or newer. It can be called
both in eager and graph mode.
that means the function can handle both URL or saved model in a file system, to confirm that i checked the documentation of hub.resolve() which is being used internally in hub.load() and there i found some thing of your interest
def resolve(handle):
"""Resolves a module handle into a path.
This function works both for plain TF2 SavedModels and the legacy TF1 Hub
format.
Resolves a module handle into a path by downloading and caching in
location specified by TF_HUB_CACHE_DIR if needed.
Currently, three types of module handles are supported:
1) Smart URL resolvers such as tfhub.dev, e.g.:
https://tfhub.dev/google/nnlm-en-dim128/1.
2) A directory on a file system supported by Tensorflow containing module
files. This may include a local directory (e.g. /usr/local/mymodule) or a
Google Cloud Storage bucket (gs://mymodule).
3) A URL pointing to a TGZ archive of a module, e.g.
https://example.com/mymodule.tar.gz.
Args:
handle: (string) the Module handle to resolve.
Returns:
A string representing the Module path.
"""
return registry.resolver(handle)
The documentation clearly says it supports the path to the local file system which points to the module/model files, you should now perform some experiments and give it a try. For more details have a look on this source file

How To Download Google Pegasus Library Model

I am a very newbie and currently working for my Final Project. I watch a youtube video that teach me to code Abstractive Text Summarization with google's Pegasus library. It Works fine but I need it to be more efficient.
So here is the code
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
Everytime I run that code, it always download the "Google Pegasus-xsum" library which sized about 2.2 GB.
So here is the sample of the code in notebook : https://github.com/nicknochnack/PegasusSummarization/blob/main/Pegasus%20Tutorial.ipynb
and it will running download the library like picture below :
Is there any way to download the library first and then I saved it locally, and everytime I run the code it's just gonna call the library locally?
Something like caching or saving the library locally maybe?
Thanks.

Mac
Using inspect you can find and locate the modules easily.
import inspect
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
print(inspect.getfile(PegasusForConditionalGeneration))
print(inspect.getfile(PegasusTokenizer))
You will get their paths sth like this
/usr/local/lib/python3.9/site-packages/transformers/models/pegasus/modeling_pegasus.py
/usr/local/lib/python3.9/site-packages/transformers/models/pegasus/tokenization_pegasus.py
Now, if you go and see what is inside the tokenization_pegasus.py file, you will notice that the model of google/pegasus-xsum is being probably fetched by the following line
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {"google/pegasus-xsum": "https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model"}
}
where here if you open:
https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model
You will get the model downloaded directly to your machine.
UPDATE
After some search on Google, I've found sth important where you can get the used models and all their related files downloaded to your working directory by the following
tokenizer.save_pretrained("local_pegasus-xsum_tokenizer")
model.save_pretrained("local_pegasus-xsum_tokenizer_model")
Ref:
https://github.com/huggingface/transformers/issues/14561
So that after running it, you will see the following being saved automatically in your working directory. So, now you can call the models directly but you need to search how...
Also, the 12.2GB file that you wanted to know its path locally, it is being located here online
https://huggingface.co/google/pegasus-xsum/tree/main
And after downloading the models to your directory as you can see from the screenshot its name is pytorch_model.bin as it’s named online.

pickled python machine learning model uses hardcoded paths, doesn't run on other machine - what to do?

I use AutoGluon to create ML models locally on my computer.
Now I want to deploy them through AWS, but I realized that all the pickle files created in the process use hardcoded path references to other pickle files:
/home/myname/Desktop/ETC_PATH/AutoGluon/
I use cloudpickle.dump(predictor, open('FINAL_MODEL.pkl', 'wb')) to pickle the final ensemble model, but AutoGluon creates numerous other pickle files of the individual models, which are then referenced as /home/myname/Desktop/ETC_PATH/AutoGluon/models/ and /home/myname/Desktop/ETC_PATH/AutoGluon/models/specific_model/ and so forth...
How can I achieve that all absolute paths everywhere are replaced by relative paths like root/AutoGluon/WHATEVER_PATH, where root could be set to anything, depending on where the model is later saved.
Any pointers would be helpful.
EDIT: I'm reasonably sure I found the problem. If, instead of loading FINAL_MODEL.pkl (that seems to hardcode paths) I use AutoGluon's predictor = task.load(model_dir) it should find all dependencies correctly, whether or not the AutoGluon folder as a whole was moved. This issue on github helped

EDIT: This solved the problem: If, instead of loading FINAL_MODEL.pkl (that seems to hardcode paths) I use AutoGluon's predictor = task.load(model_dir) it should find all dependencies correctly, whether or not the AutoGluon folder as a whole was moved. This issue on github helped

How to use tf-hub models locally

I,ve been trying to use a BERT model from tf-hub https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2.
import tensorflow_hub as hub
bert_layer = hub.keras_layer('./bert_en_uncased_L-12_H-768_A-12_2', trainable=True)
But problem is that it is downloading data after every run.
So i downloaded the .tar file from tf-hub https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2
Now i'm trying to use this downloaded tar file(after untar)
I've followed this tutorial https://medium.com/#xianbao.qian/how-to-run-tf-hub-locally-without-internet-connection-4506b850a915
But it didn't work out well and there is no further infromation or script is provided in this blog post
if someone can provide complete script to use the dowloaded model locally(without internet) or can improve the above blog post(Medium).
I've also tried
untarredFilePath = './bert_en_uncased_L-12_H-768_A-12_2'
bert_lyr = hub.load(untarredFilePath)
print(bert_lyr)
Output
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x7f05c46e6a10>
Doesn't seems to work.
or is there any other method to do so..??

Hmm I cannot reproduce your problem. What worked for me:
script.sh
# download the model file using the 'wget' program
wget "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2?tf-hub-format=compressed"
# rename the downloaded file name to 'tar_file.tar.gz'
mv 2\?tf-hub-format\=compressed tar_file.tar.gz
# extract tar_file.tar.gz to the local directory
tar -zxvf tar_file.tar.gz
# turn off internet
# run a test script
python3 test.py
# running the last command prints some tensorflow warnings, and then '<tensorflow_hub.keras_layer.KerasLayer object at 0x7fd702a7d8d0>'
test.py
import tensorflow_hub as hub
print(hub.KerasLayer('.'))

I wrote this script using this medium article(https://medium.com/#xianbao.qian/how-to-run-tf-hub-locally-without-internet-connection-4506b850a915) as reference. I am creating a cache directory within my project and the tensorflow model is cached locally in this cached directory and I am able to load the model locally. Hope this helps you.
import os
os.environ["TFHUB_CACHE_DIR"] = r'C:\Users\USERX\PycharmProjects\PROJECTX\tf_hub'
import tensorflow as tf
import tensorflow_hub as hub
import hashlib
handle = "https://tfhub.dev/google/universal-sentence-encoder/4"
hashlib.sha1(handle.encode("utf8")).hexdigest()
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
def get_sentence_embeddings(paragraph_array):
embeddings=embed(paragraph_array)
return embeddings

After getting information from tf-hub team they provide this solution.
Let's say you have downloaded the .tar.gz file from official tf-hub model page from download button.
You have extracted it. You got a folder which contain assets, variable and model.
You put it in your working directory.
In script just add path to that folder:
import tensroflow-hub as hub
model_path ='./bert_en_uncased_L-12_H-768_A-12_2' # in my case
# one thing the path you have to provide is for folder which contain assets, variable and model
# not of the model.pb itself
lyr = hub.KerasLayer(model_path, trainable=True)
Hope it should work for you as well. Give it a try

The tensorflow_hub library caches downloaded and uncompressed models on disk to avoid repeated uploads. The documentation at tensorflow.org/hub/caching has been expanded to discuss this and other cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using pipelines with a local model - python

Must be either of the two cases: You didn't download all the required files properly Folder path is wrong FYI, I am listing out the required contents in the directory: config.json pytorch_model.bin/ tf_model.h5 special_tokens_map.json tokenizer.json tokenizer_config.json vocab.txt

the solution was slightly indirect: load the model on a computer with internet access save the model with save_pretrained() transfer the folder obtained above to the offline machine and point its path in the pipeline call The folder will contain all the expected files.

Related

How to use config.json file for Hugging Face transformer model loading from Google Cloud Storage (GCS)?

How to download and use the universal sentence encoder instead of loading it from url

How To Download Google Pegasus Library Model

pickled python machine learning model uses hardcoded paths, doesn't run on other machine - what to do?

How to use tf-hub models locally

Categories

Resources