Deleting a very big folder in Google Cloud Storage

Deleting a very big folder in Google Cloud Storage - python

I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.
def deleteStorageFolder(bucketName, folder):
from google.cloud import storage
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
logging.info("Deleting : " + folder)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
logging.info(str(e.message))
It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.
Obviously, this fails due to the timeout. What would be the best strategy here ?
(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)

2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :
if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm

There is a workaround. You can do this in 2 steps
"Move" your file to delete into another bucket with Transfert
Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox
After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.
Go to bucket page
Click on lifecycle
Set up a lifecycle where you delete file with age > 0 day
In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!

Related

S3 to Google Storage using transfer service

In AWS I have folder format like eg : Bucketname/Data/files/abc_01-02-2022.csv
In a increment order I have files for each dates for all the months in year.
In Google Cloud Storage I am trying to create folder structure like eg:Bucketname/data/202202/files/abc_01-02-2022.csv for whole year
So, I am trying to use storage transfer service which will take dynamically or from object itself and create a folder structure automatically by getting trigger automatically 2nd of the month.
Can we achieve this by using transfer service.
what is the best way to achieve this I am trying to make it simple as possible

Storage Transfer Service does not support destination object prefixes, the reason behind it is, Storage Transfer Service doesn’t support remapping, that is, you cannot copy the path Bucketname/Data/files/ to Bucketname/data/202202/files
My recommendation would be to first use the Storage Transfer Service to copy everything from one bucket to another and later use any of the available methods to rename the object in the new bucket to Bucketname/data/202202/files.
Also the Cloud Storage Objects are flat namespaces, that is, Cloud Storage does not have folders and sub folders. There are a few documents that you can refer to for more information on this Object name considerations and Folders

This is possible via using STS API. You can specify "path" at the destination bucket.

Alternative to Cloud Functions for Long Duration Tasks on Google

I have been using Google Cloud Functions (GCF) to setup a serverless environment. This works fine and it covers most of the required functionality that I need.
However, for one specific module, extracting data from FTP servers, the duration of parsing the files from a provider takes longer than 540s. For this reason, the task that I execute gets timed out when deploying it as a cloud function.
In addition, some FTP servers require that they whitelist an ip address that is making these requests. When using Cloud functions, unless you reserve somehow a static address or a range, this is not possible.
I am therefore looking for an alternative solution to execute a Python script in the cloud on the Google platform. The requirements are:
It needs to support Python 3.7
It has to have the possibility to associate a static IP address to it
One execution should be able to take longer than 540s
Ideally, it should be possible to easily deploy the script (as it is the case with GCF)
What is the best option out there for these kind of needs?

The notion of a Cloud Function is primarily that of a Microservice ... something that runs for a relatively short period of time. In your story, we seem to have actions that can run for an extended period of time. This would seem to lend itself to the notion of running some form of compute engine. The two that immediately come to mind are Google Compute Engine (CE) and Google Kubernetes Engine (GKE). Let us think about the Compute Engine. Think of this as a Linux VM where you have 100% control over it. This needn't be a heavyweight thing ... Google provides micro compute engines which are pretty darn tiny. You can have one or more of these including the ability to dynamically scale out the number of instances if load on the set becomes too high. On your compute engine, you can create any environment you wish ... including installing a Python environment and running Flask (or other) to process incoming requests. You can associate your compute engine with a static IP address or associate a static IP address with a load balancer front-ending your engines.

Here is how I download files from FTP with Google Cloud Functions to Google Cloud Storage. It takes less than 30 secs (depending on the file size).
#import libraries
from google.cloud import storage
import wget
def importFile(request):
#set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket('BUCKET-NAME') #without gs://
blob = bucket.blob('file-name.csv')
#See if file already exists
if blob.exists() == False:
#copy file to google storage
try:
link = 'ftp://account:password#ftp.domain.com/folder/file.csv' #for non-public ftp files
ftpfile = wget.download(link, out='/tmp/destination-file-name.csv') #save downloaded file in /tmp folder of Cloud Functions
blob.upload_from_filename(ftpfile)
print('Copied file to Google Storage!')
#print error if file doesn't exists
except BaseException as error:
print('An exception occurred: {}'.format(error))
#print error if file already exists in Google Storage
else:
print('File already exists in Google Storage')

Using Custom Libraries in Google Colab without Mounting Drive

I am using Google Colab and I would like to use my custom libraries / scripts, that I have stored on my local machine. My current approach is the following:
# (Question 1)
from google.colab import drive
drive.mount("/content/gdrive")
# Annoying chain of granting access to Google Colab
# and entering the OAuth token.
And then I use:
# (Question 2)
!cp /content/gdrive/My\ Drive/awesome-project/*.py .
Question 1:
Is there a way to avoid the mounting of the drive entriely? Whenever the execution context changes (e.g. when I select "Hardware Acceleration = GPU", or when I wait an hour), I have to re-generate and re-enter the OAuth token.
Question 2:
Is there a way to sync files between my local machine and my Google Colab scripts more elegently?
Partial (not very satisfying answer) regarding Question 1: I saw that one could install and use Dropbox. Then you can hardcode the API Key into the application and mounting is done, regardless of whether or not it is a new execution context. I wonder if a similar approach exists based on Google Drive as well.

Question 1.
Great question and yes there is- I have been using this workaround which is particularly useful if you are a researcher and want other to be able to re run your code- or just 'colab'orate when working with larger datasets. The below method has worked well working as a team and there are challenges to each person having their own version of datasets.
I have used this regularly on 30 + Gb of image files downloaded and unzipped to colab run time.
The file id is in the link provided when you share from google drive
you can also select multiple files and select share all and then get a generate for example a .txt or .json file which you can parse and extract the file id's.
from google_drive_downloader import GoogleDriveDownloader as gdd
#some file id/ list of file ids parsed from file urls.
google_fid_id = '1-4PbytN2awBviPS4Brrb4puhzFb555g2'
destination = 'dir/dir/fid'
#if zip file ad kwarg unzip=true
gdd.download_file_from_google_drive(file_id=google_fid_id,
destination, unzip=True)
A url parsing function to get file ids from a list of urls might look like this:
def parse_urls():
with open('/dir/dir/files_urls.txt', 'r') as fb:
txt = fb.readlines()
return [url.split('/')[-2] for url in txt[0].split(',')]
One health warning is that you can only repeat this a small number of times in a 24 hour window for the same files.
Here's the gdd git repo:
https://github.com/ndrplz/google-drive-downloader
here is an working example (my own) of how it works inside bigger script:
https://github.com/fdsig/image_utils
Question 2.
You can connect to a local run time but this also means using local resources gpu/cpu etc.
Really hope this helps :-).
F~

If your code isn't secret, you can use git to sync your local codes to github. Then, git clone to Colab with no need for any authentication.

Automatically retrieving large files via public HTTP into Google Cloud Storage

For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.
The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.
After looking at previous stackoverflow topics, I have tried two unsuccessful methods:
1/ First attempt via urlfetch in Google App Engine
from google.appengine.api import urlfetch
url = "http://dcpc-nwp.meteo.fr/servic..."
result = urlfetch.fetch(url)
[...] # Code to save in a Google Cloud Storage bucket
But I get the following error message on the urlfetch line :
DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL
2/ Second attempt via the Cloud Storage Transfert Service
According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service :
https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata
But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.
3/ Any ideas ?
Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?

3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)

The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
try:
print('[ INFO ] Downloading {}'.format(url))
req = requests.get(url)
if req.status_code==200:
# Download and save to /tmp
output_filepath = '/tmp/{}'.format(url.split('/')[-1])
output_filename = '{}'.format(url.split('/')[-1])
open(output_filepath, 'wb').write(req.content)
print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
return output_filename
else:
print('[ ERROR ] Status Code: {}'.format(req.status_code))
except Exception as e:
print('[ ERROR ] {}'.format(e))
return output_filename

Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.

Python using GCS new client library - upload objects/'directories'

I've been through the newest docs for the GCS client library and went through the example. The sample code shows how to create a file/stream on-the-fly on GCS.
How do I resumably (that allows resumes if error) upload existing files and directories from a local directory to a GCS bucket? Using the new client library. IE, this (can't post more than 2 links so h77ps://cloud.google.com/storage/docs/gspythonlibrary#uploading-objects) is deprecated.
Thanks all
P.S
I do not need GAE functionality - This is going to sit on-premise and upload to GCS

The Python API client can perform resumable uploads. See the documentation for examples. The important bit is:
media = MediaFileUpload('pig.png', mimetype='image/png', resumable=True)
Unfortunately, the library doesn't expose the upload ID itself, so while the upload call will resume uploads if there is an error, there's no way for your application to explicitly resume an upload. If, for instance, your application was terminated and you needed to resume the upload on restart, the library won't help you. If you need that level of retry, you'll have to use another tool or just directly invoke httplib.
The Boto library accomplishes this a little differently and DOES support keeping a persistable tracking token, in case your app crashes and needs to resume. Here's a quick example, stolen from Chromium's system tests:
# Set up other stuff normally
res_upload_handler = ResumableUploadHandler(
tracker_file_name=tracker_file_name, num_retries=3
dst_key.set_contents_from_file(src_file, res_upload_handler=res_upload_handler)
Since you're interested in the new hotness, the latest, greatest Python library for accessing Google Cloud Storage is probably APITools, which also provides for recoverable, resumable uploads and also has examples.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting a very big folder in Google Cloud Storage - python

Related

S3 to Google Storage using transfer service

Alternative to Cloud Functions for Long Duration Tasks on Google

Using Custom Libraries in Google Colab without Mounting Drive

Automatically retrieving large files via public HTTP into Google Cloud Storage

Python using GCS new client library - upload objects/'directories'

Categories

Resources