I am working on a machine learning task and have saved a Keras model and want to deploy it to Github (so that I can host a web demo using Streamlit and/or Flask). However, the model file is so large (> 1 GB), that I cannot upload it to Github for free.
My thought process regarding an alternative is to upload it to a cloud service such as google drive (or dropbox, box etc.) then using some sort of Python module to access it from there.
So my question is, can I upload a pickle file containing a pickled Keras model to Google Drive and then access that object from a Python script? If so, how would I go about doing so?
Thank you!
I believe you can, you'll need to pip oauth2client & gspread. To access the data you would need to enable API manager on your google drive and get credentials in the form of a json file. Then you would need to share the file with the email in the credentials giving it permission. You could then port over the information as you needed to, I'm not sure how Keras works but this would be the first step.
Another important factor is that Google api is very touch when it comes to requests that are coming to fast, to overcome this put in sleep commands between each one, but if you do that this method may become way to slow for your idea.
scope = ["https://spreadsheets.google.com/feeds", 'https://www.googleapis.com/auth/spreadsheets',
"https://www.googleapis.com/auth/drive.file", "https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("Your json file here.json", scope)
client = gspread.authorize(creds)
sheet = client.open("your google sheets name or whatever").sheet1 # Open the spreadhseet
data = sheet.get_all_records() # you can call all the information with this.
I understand that you require a way to upload and download large files* from Drive using Python. If I understood your situation correctly, then you can achieve your goals easily by using the Drive API as #TimothyChen commented. First I highly recommend you to follow the Drive API Python Quickstart tutorial to create a working example. Later, you could modify it to use Files.create() and Files.get() to upload/download files as needed. Don't hesitate to ask me more questions if you have doubts.
*Please, keep in mind that there is a 5 TB size limit in Drive.
Related
I've been using PyDrive for Google Drive automation and it works perfectly locally. I plan to move the code to a remote shared machine, which would mean I'll need to move the secrets too. I am using LoadCredentialsFile and passing in credentials.json. I don't however think my issue is based on my own code, and rather the PyDrive code.
In short: I want to trigger a function in the PyDrive module where it would usually fetch the client_secret and client_id from the credentials.json file. And this function will fetch the client_secret and client_id from vault instead. With this, I could then delete client_id and client_secret from credentials.json (for security reasons on the shared machine) and just fetch them and store in memory them when the python script executes.
The problem I am having is this. I can delete the client_secret and client_id from the credentials.json file and hard code them under the PyDrive client.py class OAuth2Credentials(Credentials), under function from_json(), where it seems to be fetching the secrets from the json file. So instead of this inside that function:
data['access_token'],
data['client_id'],
data['client_secret'],
data['refresh_token'],
I could instead do this (i have tried this and it works):
data['access_token'],
"myhardcodedclient_id.apps.googleusercontent.com",
"myhardcodedclient_secret",
data['refresh_token'],
And then I could (haven't tried it yet) replace those hard coded values with functions to fetch the secrets from vault (hashicorp vault). Example:
data['access_token'],
get_client_id(),
get_client_secret(),
data['refresh_token'],
The problem, however, is that when the script runs, it does use the hardcoded values, but it also overwrites the existing credentials.json file (with no secrets - because I manually deleted them) and inserts the hardcoded values into that json file, which then ruins the whole idea behind using vault (not wanting to expose client secret/id to other users on the remote machine.)
Am I over complicating this? I would post the PyDrive code but there is 5000+ lines in the client.py file alone so I'm sure it would be spam and this isn't an issue with my own script (as that works perfectly as expected). If anyone has had experience doing something similar, please help! Thank you!
I am using Google Colab and I would like to use my custom libraries / scripts, that I have stored on my local machine. My current approach is the following:
# (Question 1)
from google.colab import drive
drive.mount("/content/gdrive")
# Annoying chain of granting access to Google Colab
# and entering the OAuth token.
And then I use:
# (Question 2)
!cp /content/gdrive/My\ Drive/awesome-project/*.py .
Question 1:
Is there a way to avoid the mounting of the drive entriely? Whenever the execution context changes (e.g. when I select "Hardware Acceleration = GPU", or when I wait an hour), I have to re-generate and re-enter the OAuth token.
Question 2:
Is there a way to sync files between my local machine and my Google Colab scripts more elegently?
Partial (not very satisfying answer) regarding Question 1: I saw that one could install and use Dropbox. Then you can hardcode the API Key into the application and mounting is done, regardless of whether or not it is a new execution context. I wonder if a similar approach exists based on Google Drive as well.
Question 1.
Great question and yes there is- I have been using this workaround which is particularly useful if you are a researcher and want other to be able to re run your code- or just 'colab'orate when working with larger datasets. The below method has worked well working as a team and there are challenges to each person having their own version of datasets.
I have used this regularly on 30 + Gb of image files downloaded and unzipped to colab run time.
The file id is in the link provided when you share from google drive
you can also select multiple files and select share all and then get a generate for example a .txt or .json file which you can parse and extract the file id's.
from google_drive_downloader import GoogleDriveDownloader as gdd
#some file id/ list of file ids parsed from file urls.
google_fid_id = '1-4PbytN2awBviPS4Brrb4puhzFb555g2'
destination = 'dir/dir/fid'
#if zip file ad kwarg unzip=true
gdd.download_file_from_google_drive(file_id=google_fid_id,
destination, unzip=True)
A url parsing function to get file ids from a list of urls might look like this:
def parse_urls():
with open('/dir/dir/files_urls.txt', 'r') as fb:
txt = fb.readlines()
return [url.split('/')[-2] for url in txt[0].split(',')]
One health warning is that you can only repeat this a small number of times in a 24 hour window for the same files.
Here's the gdd git repo:
https://github.com/ndrplz/google-drive-downloader
here is an working example (my own) of how it works inside bigger script:
https://github.com/fdsig/image_utils
Question 2.
You can connect to a local run time but this also means using local resources gpu/cpu etc.
Really hope this helps :-).
F~
If your code isn't secret, you can use git to sync your local codes to github. Then, git clone to Colab with no need for any authentication.
I have problems with the authentication in the Python Library of Google Cloud API.
At first it worked for some days without problem, but suddenly the API calls are not showing up in the API Overview of the Google CloudPlatform.
I created a service account and stored the json file locally. Then I set the environment variable GCLOUD_PROJECT to the project ID and GOOGLE_APPLICATION_CREDENTIALS to the path of the json file.
from google.cloud import speech
client = speech.Client()
print(client._credentials.service_account_email)
prints the correct service account email.
The following code transcribes the audio_file successfully, but the Dashboard for my Google Cloud project doesn't show anything for the activated Speech API Graph.
import io
with io.open(audio_file, 'rb') as f:
audio = client.sample(f.read(), source_uri=None, sample_rate=48000, encoding=speech.encoding.Encoding.FLAC)
alternatives = audio.sync_recognize(language_code='de-DE')
At some point the code also ran in some errors, regarding the usage limit. I guess due to the unsuccessful authentication, the free/limited option is used somehow.
I also tried the alternative option for authentication by installing the Google Cloud SDK and gcloud auth application-default login, but without success.
I have no idea where to start troubleshooting the problem.
Any help is appreciated!
(My system is running Windows 7 with Anaconda)
EDIT:
The error count (Fehler) is increasing with calls to the API. How can I get detailed information about the error?!
Make sure you are using an absolute path when setting the GOOGLE_APPLICATION_CREDENTIALS environment variable. Also, you might want to try inspecting the access token using OAuth2 tokeninfo and make sure it has "scope": "https://www.googleapis.com/auth/cloud-platform" in its response.
Sometimes you will get different error information if you initialize the client with GRPC enabled:
0.24.0:
speech_client = speech.Client(_use_grpc=True)
0.23.0:
speech_client = speech.Client(use_gax=True)
Usually it's an encoding issue, can you try with the sample audio or try generating LINEAR16 samples using something like the Unix rec tool:
rec --channels=1 --bits=16 --rate=44100 audio.wav trim 0 5
...
with io.open(speech_file, 'rb') as audio_file:
content = audio_file.read()
audio_sample = speech_client.sample(
content,
source_uri=None,
encoding='LINEAR16',
sample_rate=44100)
Other notes:
Sync Recognize is limited to 60 seconds of audio, you must use async for longer audio
If you haven't already, set up billing for your account
With regards to the usage problem, the issue is in fact that when you use the new google-cloud library to access ML APIs, it seems everyone authenticates to a project shared by everyone (hence it says you've used up your limit even though you've not used anything). To check and confirm this, you can call an ML API that you have not enabled by using the python client library, which will give you a result even though it shouldn't. This problem persists to other language client libraries and OS, so I suspect it's an issue with their grpc.
Because of this, to ensure consistency I always use the older googleapiclient that uses my API key. Here is an example to use the translate API:
from googleapiclient import discovery
service = discovery.build('translate', 'v2', developerKey='')
service_request = service.translations().list(q='hello world', target='zh')
result = service_request.execute()
print(result)
For the speech API, it's something along the lines of:
from googleapiclient import discovery
service = discovery.build('speech', 'v1beta1', developerKey='')
service_request = service.speech().syncrecognize()
result = service_request.execute()
print(result)
You can get the list of the discovery APIs at https://developers.google.com/api-client-library/python/apis/ with the speech one located in https://developers.google.com/resources/api-libraries/documentation/speech/v1beta1/python/latest/.
One of the other benefits of using the discovery library is that you get a lot more options compared to the current library, although often times it's a bit more of a pain to implement.
I am creating a CSV which contains the report as a result of a cronjob. I want to share this CSV via Google spreadsheets - the report itself is versioned, so I would just dump the CSV contents into the same worksheet in the same spreadsheet each and every single time.
I have found gspread which looked very promising but unfortunately gives me NoValidUrlKeyFound Errors. The Python example for interacting with the Spreadsheets API v4 (to be found here) requires interactivity because of the OAuth flow.
Can someone point me in the right direction? Ideally I would just do:
client = spreadsheet.client(credential_file)
client.open_spreadheet(url_or_something).open_worksheet(0).dump(csv)
print("Done")
I have found the answer in a Github issue:
Using gspread, follow the official guide to receive an oauth token. You will end up with a JSON file holding the credentials.
In order to access a spreadsheet with these credentials however, you need to share the spreadsheet with the account represented by them. In the JSON file there is a property client_email. Share the document with that e-mail address and voila, you'll be able to open it!
I've been through the newest docs for the GCS client library and went through the example. The sample code shows how to create a file/stream on-the-fly on GCS.
How do I resumably (that allows resumes if error) upload existing files and directories from a local directory to a GCS bucket? Using the new client library. IE, this (can't post more than 2 links so h77ps://cloud.google.com/storage/docs/gspythonlibrary#uploading-objects) is deprecated.
Thanks all
P.S
I do not need GAE functionality - This is going to sit on-premise and upload to GCS
The Python API client can perform resumable uploads. See the documentation for examples. The important bit is:
media = MediaFileUpload('pig.png', mimetype='image/png', resumable=True)
Unfortunately, the library doesn't expose the upload ID itself, so while the upload call will resume uploads if there is an error, there's no way for your application to explicitly resume an upload. If, for instance, your application was terminated and you needed to resume the upload on restart, the library won't help you. If you need that level of retry, you'll have to use another tool or just directly invoke httplib.
The Boto library accomplishes this a little differently and DOES support keeping a persistable tracking token, in case your app crashes and needs to resume. Here's a quick example, stolen from Chromium's system tests:
# Set up other stuff normally
res_upload_handler = ResumableUploadHandler(
tracker_file_name=tracker_file_name, num_retries=3
dst_key.set_contents_from_file(src_file, res_upload_handler=res_upload_handler)
Since you're interested in the new hotness, the latest, greatest Python library for accessing Google Cloud Storage is probably APITools, which also provides for recoverable, resumable uploads and also has examples.