Save/Export a custom tokenizer from google colab notebook - python

I have a custom tokenizer and want to use it for prediction in Production API. How do I save/download the tokenizer?
This is my code trying to save it:
import pickle
from tensorflow.python.lib.io import file_io
with file_io.FileIO('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
No error, but I can't find the tokenizer after saving it. So I assume the code didn't work?

Here is the situation, using a simple file to disentangle the issue from irrelevant specificities like pickle, Tensorflow, and tokenizers:
# Run in a new Colab notebook:
%pwd
/content
%ls
sample_data/
Let's save a simple file foo.npy:
import numpy as np
np.save('foo', np.array([1,2,3]))
%ls
foo.npy sample_data/
In this stage, %ls should show tokenizer.pickle in your case instead of foo.npy.
Now, Google Drive & Colab do not communicate by default; you have to mount the drive first (it will ask for identification):
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
After which, an %ls command will give:
%ls
drive/ foo.npy sample_data/
and you can now navigate (and save) inside drive/ (i.e. actually in your Google Drive), changing the path accordingly. Anything saved there can be retrieved later.

Related

Loading a FastText Model from s3 without Saving Locally

I am looking to use a FastText model in a ML pipeline that I made and saved as a .bin file on s3. My hope is to keep this all in a cloud based pipeline, so I don't want local files. I feel like I am really close, but I can't figure out how to make a temporary .bin file. I also am not sure if I am saving and reading the FastText model in the most efficient way. The below code works, but it saves the file locally which I want to avoid.
import smart_open
file = smart_open.smart_open(s3 location of .bin model)
listed = b''.join([i for i in file])
with open("ml_model.bin", "wb") as binary_file:
binary_file.write(listed)
model = fasttext.load_model("ml_model.bin")
If you want to use the fasttext wrapper for the official Facebook FastText code, you may need to create a local temporary copy - your troubles make it seem like that code relies on opening a local file path.
You could also try the Gensim package's separate FastText support, which should accept an S3 path via its load_facebook_model() function:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
(Note, though, that Gensim doesn't support all FastText functionality, like the supervised mode.)
As partially answered by the above response, a temporary file was needed. But on top of that, the temporary file needed to be passed as a string object, which is sort of strange. Working code below:
import tempfile
import fasttext
import smart_open
from pathlib import Path
file = smart_open.smart_open(f's3://{bucket_name}/{key}')
listed = b''.join([i for i in file])
with tempfile.TemporaryDirectory() as tdir:
tfile = Path(tdir).joinpath('tempfile.bin')
tfile.write_bytes(listed)
model = fasttext.load_model(str(tfile))

Downloading Data from Google Drive Colab

I'm a beginner in TensorFlow and python in general, so any help would be much appreciated. I'm following this tutorial from tensorflow, just with my own data.
So I'm trying to download my own data from a link I got from a folder that I uploaded to google drive. I will then use that data in an image classifier model. However, I start to see the images getting downloaded and it says:
dataset_url_training = "https://drive.google.com/drive/folders/genericid?usp=sharing"
data_dir_training = tf.keras.utils.get_file('flower_photos', origin=dataset_url_training, untar=True)
data_dir_training = pathlib.Path(data_dir_training)
Downloading data from https://drive.google.com/drive/folders/genericid?usp=sharing
106496/Unknown - 2s 14us/step
And then it just stops. And when I try to use the following code:
print(image_count)
The output spits out: 0
I'm really confused and I don't know what to do. Some suggestions have been to make a zip file url, but that only applies to individual files and doesn't work for whole folders like mine. Furthermore, as far as I know, Google Drive doesn't allow you to get links for zip files, just those for sharing (they are my own files, for clarification).
Thank you.
Edit 1: Just want to be clear: I'm NOT looking for a path. I'm looking for a URL, hence the use of a directory. I've also tried using the. link of a zip file, but I got the same error message as before.
When you want to use a file from google drive in colab you can mount our drive to colab.
from google.colab import drive
drive.mount('/content/gdrive')
Than you can open files form google drive.
For example your file is in the directory "folder" on your main drive page:
path = "gdrive/My Drive/folder/flower_photos"
Edit/Addition:
To make it more clear, you change this part from the tutorial
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
to this
import pathlib
from google.colab import drive
drive.mount('/content/gdrive')
data_dir = "gdrive/My Drive/flower_photos"
data_dir = pathlib.Path(data_dir)

How Update Google Drive file in Colab without version history?

everyone. I use colab and big dataset for my task.. about 10Gb dataset divided into 50 parts (I use parquet)... and it's 1GB of free space left. When I try to update my dataset_parts.parquet files on goole drive, GD create new verion of my file. I don't need it, because I don't have free space for it.
So, How I can update my files without verion control?
#its example of my code
from google.colab import drive
drive.mount('/content/drive')
import pands as pd
df = pd.read_parquet('/content/drive/MyDrive/file_0.parquet',columns=columns)
df['amnt']=df['coral']/10
df.to_parquet('/content/drive/MyDrive/file_0.parquet, compression='gzip')

How to use tf-hub models locally

I,ve been trying to use a BERT model from tf-hub https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2.
import tensorflow_hub as hub
bert_layer = hub.keras_layer('./bert_en_uncased_L-12_H-768_A-12_2', trainable=True)
But problem is that it is downloading data after every run.
So i downloaded the .tar file from tf-hub https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2
Now i'm trying to use this downloaded tar file(after untar)
I've followed this tutorial https://medium.com/#xianbao.qian/how-to-run-tf-hub-locally-without-internet-connection-4506b850a915
But it didn't work out well and there is no further infromation or script is provided in this blog post
if someone can provide complete script to use the dowloaded model locally(without internet) or can improve the above blog post(Medium).
I've also tried
untarredFilePath = './bert_en_uncased_L-12_H-768_A-12_2'
bert_lyr = hub.load(untarredFilePath)
print(bert_lyr)
Output
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x7f05c46e6a10>
Doesn't seems to work.
or is there any other method to do so..??
Hmm I cannot reproduce your problem. What worked for me:
script.sh
# download the model file using the 'wget' program
wget "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2?tf-hub-format=compressed"
# rename the downloaded file name to 'tar_file.tar.gz'
mv 2\?tf-hub-format\=compressed tar_file.tar.gz
# extract tar_file.tar.gz to the local directory
tar -zxvf tar_file.tar.gz
# turn off internet
# run a test script
python3 test.py
# running the last command prints some tensorflow warnings, and then '<tensorflow_hub.keras_layer.KerasLayer object at 0x7fd702a7d8d0>'
test.py
import tensorflow_hub as hub
print(hub.KerasLayer('.'))
I wrote this script using this medium article(https://medium.com/#xianbao.qian/how-to-run-tf-hub-locally-without-internet-connection-4506b850a915) as reference. I am creating a cache directory within my project and the tensorflow model is cached locally in this cached directory and I am able to load the model locally. Hope this helps you.
import os
os.environ["TFHUB_CACHE_DIR"] = r'C:\Users\USERX\PycharmProjects\PROJECTX\tf_hub'
import tensorflow as tf
import tensorflow_hub as hub
import hashlib
handle = "https://tfhub.dev/google/universal-sentence-encoder/4"
hashlib.sha1(handle.encode("utf8")).hexdigest()
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
def get_sentence_embeddings(paragraph_array):
embeddings=embed(paragraph_array)
return embeddings
After getting information from tf-hub team they provide this solution.
Let's say you have downloaded the .tar.gz file from official tf-hub model page from download button.
You have extracted it. You got a folder which contain assets, variable and model.
You put it in your working directory.
In script just add path to that folder:
import tensroflow-hub as hub
model_path ='./bert_en_uncased_L-12_H-768_A-12_2' # in my case
# one thing the path you have to provide is for folder which contain assets, variable and model
# not of the model.pb itself
lyr = hub.KerasLayer(model_path, trainable=True)
Hope it should work for you as well. Give it a try
The tensorflow_hub library caches downloaded and uncompressed models on disk to avoid repeated uploads. The documentation at tensorflow.org/hub/caching has been expanded to discuss this and other cases.

How to save your data you've already loaded and processed in Google Colab notebook so you don't have to reload it everytime?

I've read about 'pickle'-ing from the pickle library, but does that only save models you've trained and not the actual dataframe you've loaded into a variable from a massive csv file for instance?
This example notebook has some examples of different ways to save and load data.
You can actually use pickle to save any Python object, including Pandas dataframes, however it's more usual to serialize using one of Pandas' methods pandas.DataFrame.to_csv, to_feather etc.
I would probably recommend the option which uses the GCS command-line-tool which you can run from inside your notebook by prefixing with !
import pandas as pd
# Create a local file to upload.
df = pd.DataFrame([1,2,3])
df.to_csv("/tmp/to_upload.txt")
# Copy the file to our new bucket.
# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/cp
!gsutil cp /tmp/to_upload.txt gs://my-bucket/

Categories

Resources