Automatically retrieving large files via public HTTP into Google Cloud Storage

Automatically retrieving large files via public HTTP into Google Cloud Storage - python

For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.
The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.
After looking at previous stackoverflow topics, I have tried two unsuccessful methods:
1/ First attempt via urlfetch in Google App Engine
from google.appengine.api import urlfetch
url = "http://dcpc-nwp.meteo.fr/servic..."
result = urlfetch.fetch(url)
[...] # Code to save in a Google Cloud Storage bucket
But I get the following error message on the urlfetch line :
DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL
2/ Second attempt via the Cloud Storage Transfert Service
According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service :
https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata
But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.
3/ Any ideas ?
Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?

3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)

The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
try:
print('[ INFO ] Downloading {}'.format(url))
req = requests.get(url)
if req.status_code==200:
# Download and save to /tmp
output_filepath = '/tmp/{}'.format(url.split('/')[-1])
output_filename = '{}'.format(url.split('/')[-1])
open(output_filepath, 'wb').write(req.content)
print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
return output_filename
else:
print('[ ERROR ] Status Code: {}'.format(req.status_code))
except Exception as e:
print('[ ERROR ] {}'.format(e))
return output_filename

Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.

Related

Python Data Ingestion (Start with API call "Get" to AWS S3 Bucket), how to manage the username/pwd/api-key and token( expired in short time window)

data source is from SaaS Server's API endpoints, aim to use python to move data into AWS S3 Bucket(Python's Boto3 lib)
API is assigned via authorized Username/password combination and unique api-key.
then every time initially call API need get Token for further info fetch.
have 2 question:
how to manage those secrets above, save to a head file (*.ini, *.json *.yaml) or saved via AWS's Secret-Manager?
the Token is a bit challenging, the basically way is each Endpoint, fetch a new token and do the API call
then that's end of too many pipeline (like if 100 Endpoints info need per downstream business needs) then
need to craft 100 pipeline like an universal template repeating 100 times.
I am new to Python programing world, you all feel free to comment to share any user-case.
Much appreciate !!
I searched and read this show-case
[saving-from-api-to-s3-bucket/74648533]
saving from api to s3 bucket
and
"how-to-write-a-file-or-data-to-an-s3-object-using-boto3"
How to write a file or data to an S3 object using boto3

I found this has been helpful:
#Python-decopule summary: store parameters in .ini or .env files;
#few options of manage(hiding) sensitive info
a. IAM role
b. Store Secrets using **Parameter Store**
c. Store Secrets using **Secrets Manager** - Current method
recommended by AWS

How to upload from Digital Ocean Storage to Google Cloud Storage, directly, programatically without rclone

I want to migrate files from Digital Ocean Storage into Google Cloud Storage programatically without rclone.
I know the exact location file that resides in the Digital Ocean Storage(DOS), and I have the signed url for the Google Cloud Storage(GCS).
How can I modify the following code so I can copy the DOS file directly into GCS without intermediate download to my computer ?
def upload_to_gcs_bucket(blob_name, path_to_file, bucket_name):
""" Upload data to a bucket"""
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'creds.json')
#print(buckets = list(storage_client.list_buckets())
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.upload_from_filename(path_to_file)
#returns a public url
return blob.public_url

Google's Storage Transfer Servivce should be an answer for this type of problem (particularly because DigitalOcean Spaces like most is S3-compatible. But (!) I think (I'm unfamiliar with it and unsure) it can't be used for this configuration.
There is no way to transfer files from a source to a destination without some form of intermediate transfer but what you can do is use memory rather than using file storage as the intermediary. Memory is generally more constrained than file storage and if you wish to run multiple transfers concurrently, each will consume some amount of storage.
It's curious that you're using Signed URLs. Generally Signed URLs are provided by a 3rd-party to limit access to 3rd-party buckets. If you own the destination bucket, then it will be easier to use Google Cloud Storage buckets directly from one of Google's client libraries, such as Python Client Library.
The Python examples include uploading from file and from memory. It will likely be best to stream the files into Cloud Storage if you'd prefer to not create intermediate files. Here's a Python example

Deleting a very big folder in Google Cloud Storage

I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.
def deleteStorageFolder(bucketName, folder):
from google.cloud import storage
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
logging.info("Deleting : " + folder)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
logging.info(str(e.message))
It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.
Obviously, this fails due to the timeout. What would be the best strategy here ?
(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)

2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :
if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm

There is a workaround. You can do this in 2 steps
"Move" your file to delete into another bucket with Transfert
Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox
After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.
Go to bucket page
Click on lifecycle
Set up a lifecycle where you delete file with age > 0 day
In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!

Alternative to Cloud Functions for Long Duration Tasks on Google

I have been using Google Cloud Functions (GCF) to setup a serverless environment. This works fine and it covers most of the required functionality that I need.
However, for one specific module, extracting data from FTP servers, the duration of parsing the files from a provider takes longer than 540s. For this reason, the task that I execute gets timed out when deploying it as a cloud function.
In addition, some FTP servers require that they whitelist an ip address that is making these requests. When using Cloud functions, unless you reserve somehow a static address or a range, this is not possible.
I am therefore looking for an alternative solution to execute a Python script in the cloud on the Google platform. The requirements are:
It needs to support Python 3.7
It has to have the possibility to associate a static IP address to it
One execution should be able to take longer than 540s
Ideally, it should be possible to easily deploy the script (as it is the case with GCF)
What is the best option out there for these kind of needs?

The notion of a Cloud Function is primarily that of a Microservice ... something that runs for a relatively short period of time. In your story, we seem to have actions that can run for an extended period of time. This would seem to lend itself to the notion of running some form of compute engine. The two that immediately come to mind are Google Compute Engine (CE) and Google Kubernetes Engine (GKE). Let us think about the Compute Engine. Think of this as a Linux VM where you have 100% control over it. This needn't be a heavyweight thing ... Google provides micro compute engines which are pretty darn tiny. You can have one or more of these including the ability to dynamically scale out the number of instances if load on the set becomes too high. On your compute engine, you can create any environment you wish ... including installing a Python environment and running Flask (or other) to process incoming requests. You can associate your compute engine with a static IP address or associate a static IP address with a load balancer front-ending your engines.

Here is how I download files from FTP with Google Cloud Functions to Google Cloud Storage. It takes less than 30 secs (depending on the file size).
#import libraries
from google.cloud import storage
import wget
def importFile(request):
#set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket('BUCKET-NAME') #without gs://
blob = bucket.blob('file-name.csv')
#See if file already exists
if blob.exists() == False:
#copy file to google storage
try:
link = 'ftp://account:password#ftp.domain.com/folder/file.csv' #for non-public ftp files
ftpfile = wget.download(link, out='/tmp/destination-file-name.csv') #save downloaded file in /tmp folder of Cloud Functions
blob.upload_from_filename(ftpfile)
print('Copied file to Google Storage!')
#print error if file doesn't exists
except BaseException as error:
print('An exception occurred: {}'.format(error))
#print error if file already exists in Google Storage
else:
print('File already exists in Google Storage')

Size Limit On Download From Cloud Storage With App Engine

tldr: Is there a file size limit to send a file from Cloud Storage to my user's web browser as a download? Am I using the Storage Python API wrong, or do I need to increase the resources set by my App Engine YAML file?
This is just an issue with downloads. Uploads work great up to any file size, using chunking.
The symptoms
I created a file transfer application with App Engine Python 3.7 Standard environment. Users are able to upload files of any size, and that is working well. But users are running into what appears to be a size limit on downloading the resulting file from Cloud Storage.
The largest file I have successfully sent and received through my entire upload/download process was 29 megabytes. Then I sent myself a 55 megabyte file, but when I attempted to receive it as a download, Flask gives me this error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Application Structure
To create my file transfer application, I used Flask to set up two Services, internal and external, each with its own Flask routing file, its own webpage/domain, and its own YAML files.
In order to test the application, I visit the internal webpage which I created. I use it to upload a file in chunks to my application, which successfully composes the chunks in Cloud Storage. I then log into Google Cloud Platform Console as an admin, and when I look at Cloud Storage, it will show me the 55 megabyte file which I uploaded. It will allow me to download it directly through the Cloud Platform Console, and the file is good.
(Up to that point of the process, this even worked for a 1.5 gigabyte file.)
Then I go to my external webpage as a non-admin user. I use the form to attempt to receive that same file as a download. I get the above error. However, this whole process encounters no errors for my 29 megabyte test file, or smaller.
Stacktrace Debugger logs for this service reveal:
logMessage: "The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)"
Possible solutions
I added the following lines to my external service YAML file:
resources:
memory_gb: 100
disk_size_gb: 100
The error remained the same. Apparently this is not a limit of system resources?
Perhaps I'm misusing the Python API for Cloud Storage. I'm importing storage from google.cloud. Here is where my application responds to the user's POST request by sending the user the file they requested:
#app.route('/download', methods=['POST'])
def provide_file():
return external_download()
This part is in external_download:
storage_client = storage.Client()
bucket = storage_client.get_bucket(current_app.cloud_storage_bucket)
bucket_filename = request.form['filename']
blob = bucket.blob(bucket_filename)
return send_file(io.BytesIO(blob.download_as_string()),
mimetype="application/octet-stream",
as_attachment=True,
attachment_filename=filename)
Do I need to implement chunking for downloads, not just uploads?

I wouldn’t recommend to use Flask's send_file() method for managing large files transfer, Flask file handling methods were intended for use by developers or API's, to exchange system messages mostly, like logs, cookies and other light objects.
Besides, the download_as_string() method might indeed conceal a buffer limitation, I did reproduce your scenario and got the same error message with files larger than 30mb, however I couldn’t find more information about such constraint. It might be intentional, induced by the method’s purpose (download content as string, not fitted for large objects).
Proven ways to handle file transfer efficiently with Cloud Storage and Python:
Use methods of the Cloud Storage API directly, to download and upload objects without using Flask. As #FridayPush mentioned, it will offload your application, and you can control access with Signed URL’s.
Use the Blobstore API, a straightforward, lightweight, and simple solution to handle file transfer, fully integrated with GCS buckets and intended for this type of situation.
Use the built-in Python Requests module, requires to create your own handlers to communicate with GCS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.