Get a local text file in tensorflow keras

Get a local text file in tensorflow keras - python

I was following a tutorial online on using tensorflow and he used this code:
prepWork = tf.keras.utils.get_file('shakespeare.txt', urlToTextFile)
If I want to use this code for my own project, I need to read a local text file, let's say 'prepWork.txt', from my machine. I can't use get_file, because that only works for online files. How would I do this? Everything I've tried before doesn't work.

You can find a TextLoader class which reads a text file and transforms it into batches of consecutive fixed sized length of words in the following repository (in utils.py file) https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/utils.py

Related

How to download and use the universal sentence encoder instead of loading it from url

I am using the universal sentence encoder to find sentence similarity. below is the code that i use to load the model
import tensorflow_hub as hub
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")
Here, instead of lookin at the url in tensorflow hub, Is there a way to download the model programmatically to a local location and load the model from local filesystem

Well i am not sure and i haven't tried it but i checked the source of the hub.load() and i found some interesting facts may be they help you for your problem
First of all the doc says
This function is roughly equivalent to the TF2 function
tf.saved_model.load() on the result of hub.resolve(handle).
Calling this function requires TF 1.14 or newer. It can be called
both in eager and graph mode.
that means the function can handle both URL or saved model in a file system, to confirm that i checked the documentation of hub.resolve() which is being used internally in hub.load() and there i found some thing of your interest
def resolve(handle):
"""Resolves a module handle into a path.
This function works both for plain TF2 SavedModels and the legacy TF1 Hub
format.
Resolves a module handle into a path by downloading and caching in
location specified by TF_HUB_CACHE_DIR if needed.
Currently, three types of module handles are supported:
1) Smart URL resolvers such as tfhub.dev, e.g.:
https://tfhub.dev/google/nnlm-en-dim128/1.
2) A directory on a file system supported by Tensorflow containing module
files. This may include a local directory (e.g. /usr/local/mymodule) or a
Google Cloud Storage bucket (gs://mymodule).
3) A URL pointing to a TGZ archive of a module, e.g.
https://example.com/mymodule.tar.gz.
Args:
handle: (string) the Module handle to resolve.
Returns:
A string representing the Module path.
"""
return registry.resolver(handle)
The documentation clearly says it supports the path to the local file system which points to the module/model files, you should now perform some experiments and give it a try. For more details have a look on this source file

How To Download Google Pegasus Library Model

I am a very newbie and currently working for my Final Project. I watch a youtube video that teach me to code Abstractive Text Summarization with google's Pegasus library. It Works fine but I need it to be more efficient.
So here is the code
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
Everytime I run that code, it always download the "Google Pegasus-xsum" library which sized about 2.2 GB.
So here is the sample of the code in notebook : https://github.com/nicknochnack/PegasusSummarization/blob/main/Pegasus%20Tutorial.ipynb
and it will running download the library like picture below :
Is there any way to download the library first and then I saved it locally, and everytime I run the code it's just gonna call the library locally?
Something like caching or saving the library locally maybe?
Thanks.

Mac
Using inspect you can find and locate the modules easily.
import inspect
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
print(inspect.getfile(PegasusForConditionalGeneration))
print(inspect.getfile(PegasusTokenizer))
You will get their paths sth like this
/usr/local/lib/python3.9/site-packages/transformers/models/pegasus/modeling_pegasus.py
/usr/local/lib/python3.9/site-packages/transformers/models/pegasus/tokenization_pegasus.py
Now, if you go and see what is inside the tokenization_pegasus.py file, you will notice that the model of google/pegasus-xsum is being probably fetched by the following line
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {"google/pegasus-xsum": "https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model"}
}
where here if you open:
https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model
You will get the model downloaded directly to your machine.
UPDATE
After some search on Google, I've found sth important where you can get the used models and all their related files downloaded to your working directory by the following
tokenizer.save_pretrained("local_pegasus-xsum_tokenizer")
model.save_pretrained("local_pegasus-xsum_tokenizer_model")
Ref:
https://github.com/huggingface/transformers/issues/14561
So that after running it, you will see the following being saved automatically in your working directory. So, now you can call the models directly but you need to search how...
Also, the 12.2GB file that you wanted to know its path locally, it is being located here online
https://huggingface.co/google/pegasus-xsum/tree/main
And after downloading the models to your directory as you can see from the screenshot its name is pytorch_model.bin as it’s named online.

Loading a FastText Model from s3 without Saving Locally

I am looking to use a FastText model in a ML pipeline that I made and saved as a .bin file on s3. My hope is to keep this all in a cloud based pipeline, so I don't want local files. I feel like I am really close, but I can't figure out how to make a temporary .bin file. I also am not sure if I am saving and reading the FastText model in the most efficient way. The below code works, but it saves the file locally which I want to avoid.
import smart_open
file = smart_open.smart_open(s3 location of .bin model)
listed = b''.join([i for i in file])
with open("ml_model.bin", "wb") as binary_file:
binary_file.write(listed)
model = fasttext.load_model("ml_model.bin")

If you want to use the fasttext wrapper for the official Facebook FastText code, you may need to create a local temporary copy - your troubles make it seem like that code relies on opening a local file path.
You could also try the Gensim package's separate FastText support, which should accept an S3 path via its load_facebook_model() function:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
(Note, though, that Gensim doesn't support all FastText functionality, like the supervised mode.)

As partially answered by the above response, a temporary file was needed. But on top of that, the temporary file needed to be passed as a string object, which is sort of strange. Working code below:
import tempfile
import fasttext
import smart_open
from pathlib import Path
file = smart_open.smart_open(f's3://{bucket_name}/{key}')
listed = b''.join([i for i in file])
with tempfile.TemporaryDirectory() as tdir:
tfile = Path(tdir).joinpath('tempfile.bin')
tfile.write_bytes(listed)
model = fasttext.load_model(str(tfile))

Saving vcf to npz and then loading to Python. No error, but no output

I'm trying to analyze genome data from a huge (1.75GB compressed) vcf file using Python. The technician suggested I use scikit-allel and gave me this link: http://alimanfoo.github.io/2017/06/14/read-vcf.html. I wasn't able to install the module on my computer; but I successfully installed it on a cluster which I access through vpn. There, I successfully opened the file and have been able to access the data. But I can only access the cluster through a command line interface, and that isn't as friendly as the Spyder I have on my computer; so I've been trying to bring the data back. The GitHub link says I can save the data into a npz file which I can read straight into Python's numpy; so I've been trying to do that.
First, I tried allel.vcf_to_npz('existing_name.vcf','new_name.npz',fields='calldata/GT') on the cluster. This created a (suspiciously small) new npz file on the cluster, which I downloaded. But when I opened up Spyder on my computer and typed genotypes=np.load('real_genotypes.npz'), no new variable called genotypes appeared in the Variable Explorer. Adding the line print(genotypes) produces <numpy.lib.npyio.NpzFile object at 0x00000__________>
Next, thinking that I should copy everything to be sure, I tried allel.vcf_to_npz('existing_name.vcf','new_name.npz',fields='*',overwrite=True)
This created a 2.10GB file. After a lengthy download, I tried the same thing, but got the same results: No new variable when I try to np.load the file, and <numpy.lib.npyio.NpzFile object at 0x000001DB0DEC7F88> when I ask to print it.
When I tried to Google search this problem, I saw this question: Load compressed data (.npz) from file using numpy.load. But my case looks different. I don't get an error message; I just get nothing. So what's wrong?
Thanks

RuntimeError: Did not find any input files matching the glob pattern ['D:\\ML\\Object-Detection\\data\train.record']

Hey I am working on a tensorflow project. I am using codes from this website and when I run my train.py file I'm getting the below error.
RuntimeError: Did not find any input files matching the glob pattern
['D:\ML\Object-Detection\data\train.record']

Changing the slash to '/' worked for me

Being careful with file extension might be helpful for other.
I have generated .record while it was required to create .tfrecord
And vice versa :)

I found what happened.
instead of using existed pipeline.config
I did a download
!wget https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/configs/tf2/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.config
In this example I used ssd_mobilenet, for you, you can change your model name.
check tensorflow object detection api config here
If it's not work yet. please try to remove '/' and insert infront of directory

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.