I am looking to use a FastText model in a ML pipeline that I made and saved as a .bin file on s3. My hope is to keep this all in a cloud based pipeline, so I don't want local files. I feel like I am really close, but I can't figure out how to make a temporary .bin file. I also am not sure if I am saving and reading the FastText model in the most efficient way. The below code works, but it saves the file locally which I want to avoid.
import smart_open
file = smart_open.smart_open(s3 location of .bin model)
listed = b''.join([i for i in file])
with open("ml_model.bin", "wb") as binary_file:
binary_file.write(listed)
model = fasttext.load_model("ml_model.bin")
If you want to use the fasttext wrapper for the official Facebook FastText code, you may need to create a local temporary copy - your troubles make it seem like that code relies on opening a local file path.
You could also try the Gensim package's separate FastText support, which should accept an S3 path via its load_facebook_model() function:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
(Note, though, that Gensim doesn't support all FastText functionality, like the supervised mode.)
As partially answered by the above response, a temporary file was needed. But on top of that, the temporary file needed to be passed as a string object, which is sort of strange. Working code below:
import tempfile
import fasttext
import smart_open
from pathlib import Path
file = smart_open.smart_open(f's3://{bucket_name}/{key}')
listed = b''.join([i for i in file])
with tempfile.TemporaryDirectory() as tdir:
tfile = Path(tdir).joinpath('tempfile.bin')
tfile.write_bytes(listed)
model = fasttext.load_model(str(tfile))
Related
I am a very newbie and currently working for my Final Project. I watch a youtube video that teach me to code Abstractive Text Summarization with google's Pegasus library. It Works fine but I need it to be more efficient.
So here is the code
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
Everytime I run that code, it always download the "Google Pegasus-xsum" library which sized about 2.2 GB.
So here is the sample of the code in notebook : https://github.com/nicknochnack/PegasusSummarization/blob/main/Pegasus%20Tutorial.ipynb
and it will running download the library like picture below :
Is there any way to download the library first and then I saved it locally, and everytime I run the code it's just gonna call the library locally?
Something like caching or saving the library locally maybe?
Thanks.
Mac
Using inspect you can find and locate the modules easily.
import inspect
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
print(inspect.getfile(PegasusForConditionalGeneration))
print(inspect.getfile(PegasusTokenizer))
You will get their paths sth like this
/usr/local/lib/python3.9/site-packages/transformers/models/pegasus/modeling_pegasus.py
/usr/local/lib/python3.9/site-packages/transformers/models/pegasus/tokenization_pegasus.py
Now, if you go and see what is inside the tokenization_pegasus.py file, you will notice that the model of google/pegasus-xsum is being probably fetched by the following line
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {"google/pegasus-xsum": "https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model"}
}
where here if you open:
https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model
You will get the model downloaded directly to your machine.
UPDATE
After some search on Google, I've found sth important where you can get the used models and all their related files downloaded to your working directory by the following
tokenizer.save_pretrained("local_pegasus-xsum_tokenizer")
model.save_pretrained("local_pegasus-xsum_tokenizer_model")
Ref:
https://github.com/huggingface/transformers/issues/14561
So that after running it, you will see the following being saved automatically in your working directory. So, now you can call the models directly but you need to search how...
Also, the 12.2GB file that you wanted to know its path locally, it is being located here online
https://huggingface.co/google/pegasus-xsum/tree/main
And after downloading the models to your directory as you can see from the screenshot its name is pytorch_model.bin as it’s named online.
I was following a tutorial online on using tensorflow and he used this code:
prepWork = tf.keras.utils.get_file('shakespeare.txt', urlToTextFile)
If I want to use this code for my own project, I need to read a local text file, let's say 'prepWork.txt', from my machine. I can't use get_file, because that only works for online files. How would I do this? Everything I've tried before doesn't work.
You can find a TextLoader class which reads a text file and transforms it into batches of consecutive fixed sized length of words in the following repository (in utils.py file) https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/utils.py
I,ve been trying to use a BERT model from tf-hub https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2.
import tensorflow_hub as hub
bert_layer = hub.keras_layer('./bert_en_uncased_L-12_H-768_A-12_2', trainable=True)
But problem is that it is downloading data after every run.
So i downloaded the .tar file from tf-hub https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2
Now i'm trying to use this downloaded tar file(after untar)
I've followed this tutorial https://medium.com/#xianbao.qian/how-to-run-tf-hub-locally-without-internet-connection-4506b850a915
But it didn't work out well and there is no further infromation or script is provided in this blog post
if someone can provide complete script to use the dowloaded model locally(without internet) or can improve the above blog post(Medium).
I've also tried
untarredFilePath = './bert_en_uncased_L-12_H-768_A-12_2'
bert_lyr = hub.load(untarredFilePath)
print(bert_lyr)
Output
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x7f05c46e6a10>
Doesn't seems to work.
or is there any other method to do so..??
Hmm I cannot reproduce your problem. What worked for me:
script.sh
# download the model file using the 'wget' program
wget "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2?tf-hub-format=compressed"
# rename the downloaded file name to 'tar_file.tar.gz'
mv 2\?tf-hub-format\=compressed tar_file.tar.gz
# extract tar_file.tar.gz to the local directory
tar -zxvf tar_file.tar.gz
# turn off internet
# run a test script
python3 test.py
# running the last command prints some tensorflow warnings, and then '<tensorflow_hub.keras_layer.KerasLayer object at 0x7fd702a7d8d0>'
test.py
import tensorflow_hub as hub
print(hub.KerasLayer('.'))
I wrote this script using this medium article(https://medium.com/#xianbao.qian/how-to-run-tf-hub-locally-without-internet-connection-4506b850a915) as reference. I am creating a cache directory within my project and the tensorflow model is cached locally in this cached directory and I am able to load the model locally. Hope this helps you.
import os
os.environ["TFHUB_CACHE_DIR"] = r'C:\Users\USERX\PycharmProjects\PROJECTX\tf_hub'
import tensorflow as tf
import tensorflow_hub as hub
import hashlib
handle = "https://tfhub.dev/google/universal-sentence-encoder/4"
hashlib.sha1(handle.encode("utf8")).hexdigest()
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
def get_sentence_embeddings(paragraph_array):
embeddings=embed(paragraph_array)
return embeddings
After getting information from tf-hub team they provide this solution.
Let's say you have downloaded the .tar.gz file from official tf-hub model page from download button.
You have extracted it. You got a folder which contain assets, variable and model.
You put it in your working directory.
In script just add path to that folder:
import tensroflow-hub as hub
model_path ='./bert_en_uncased_L-12_H-768_A-12_2' # in my case
# one thing the path you have to provide is for folder which contain assets, variable and model
# not of the model.pb itself
lyr = hub.KerasLayer(model_path, trainable=True)
Hope it should work for you as well. Give it a try
The tensorflow_hub library caches downloaded and uncompressed models on disk to avoid repeated uploads. The documentation at tensorflow.org/hub/caching has been expanded to discuss this and other cases.
I've built a simple app in Python, with a front-end UI in Dash.
It relies on three files,
small dataframe, in pickle format ,95KB
large scipy sparse matrix, in NPZ format, 12MB
large scikit KNN-model, in job lib format, 65MB
I have read in the first dataframe successfully by
link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)
But when I try this with the others, say, the model by:
mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb')
model_knn = pickle.load(filehandler)
I just get an error
Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'
I also pushed these files using Github LFS, but the same error occurs.
I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files.
The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost.
Any help appreciated, thank you.
Could you try the following?
from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)
Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.
For the ones having the KeyError 10 try
model_knn = joblib.load(mfile)
instead of
model_knn = pickle.load(mfile)
I am working on a project where I have got an .rds file which consist of trained model as per my requirement generated by R code.
Now I need to load the trained model in python and use it in processing the records.
Is there any way to do so? If not what are the alternatives.
Thanks
We can use feather:
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)