S3 to Google Storage using transfer service

S3 to Google Storage using transfer service - python

In AWS I have folder format like eg : Bucketname/Data/files/abc_01-02-2022.csv
In a increment order I have files for each dates for all the months in year.
In Google Cloud Storage I am trying to create folder structure like eg:Bucketname/data/202202/files/abc_01-02-2022.csv for whole year
So, I am trying to use storage transfer service which will take dynamically or from object itself and create a folder structure automatically by getting trigger automatically 2nd of the month.
Can we achieve this by using transfer service.
what is the best way to achieve this I am trying to make it simple as possible

Storage Transfer Service does not support destination object prefixes, the reason behind it is, Storage Transfer Service doesn’t support remapping, that is, you cannot copy the path Bucketname/Data/files/ to Bucketname/data/202202/files
My recommendation would be to first use the Storage Transfer Service to copy everything from one bucket to another and later use any of the available methods to rename the object in the new bucket to Bucketname/data/202202/files.
Also the Cloud Storage Objects are flat namespaces, that is, Cloud Storage does not have folders and sub folders. There are a few documents that you can refer to for more information on this Object name considerations and Folders

This is possible via using STS API. You can specify "path" at the destination bucket.

Related

Reading File from Vertex AI and Google Cloud Storage

I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.
The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.
#component(
packages_to_install=["google-cloud"],
base_image="python:3.9"
)
def main(
):
import csv
from io import StringIO
from google.cloud import storage
BUCKET_NAME = "gs://my_bucket"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('test/test.txt')
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
The error I'm getting looks like the following:
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found
What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.

Try specifying bucket name w/o a gs://. This should fix the issue. One more stackoverflow post that says the same thing: Cloud Storage python client fails to retrieve bucket
any storage bucket you try to access in GCP has a unique address to access it. That address starts with a gs:// always which specifies that it is a cloud storage url. Now, GCS apis are designed such that they need the bucket name only to work with it. Hence, you just pass the bucket name. If you were accessing the bucket via browser you will need the complete address to access and hence the gs:// prefix as well.

Google cloud storage cycling through a directory

I have a storage bucket in google cloud. I have a few directories which I created with files in them.
I know that if I want to cycle through all the files in all the directories, I can use the following command:
for file in list(source_bucket.list_blobs()):
file_path=f"gs://{file.bucket.name}/{file.name}"
print(file_path)
Is there a way to only cycle through one of the directories?

I suggest studying the Cloud Storage list API in more detail. It looks like you have only experimented with the most basic use of list_blobs(). As you can see from the linked API documentation, you can pass a prefix parameter to limit the scope of the list to some path. source_bucket.list_blobs(prefix="path"):

Deleting a very big folder in Google Cloud Storage

I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.
def deleteStorageFolder(bucketName, folder):
from google.cloud import storage
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
logging.info("Deleting : " + folder)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
logging.info(str(e.message))
It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.
Obviously, this fails due to the timeout. What would be the best strategy here ?
(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)

2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :
if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm

There is a workaround. You can do this in 2 steps
"Move" your file to delete into another bucket with Transfert
Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox
After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.
Go to bucket page
Click on lifecycle
Set up a lifecycle where you delete file with age > 0 day
In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!

Access google cloud storage account objects from app engine

Overview
I have a GCP storage bucket, which has a .json file and 5 jpeg files. In the .json file the image names match the jpeg file names. I want to know a way which i can access each of the object within the storage account based upon the image name.
Method 1 (Current Method):
Currently, a python script is been used to to get the images from the storage bucket. This is been done by looping through the .json file of image names, getting each individual image name, then building a URL based on the bucket/image name and retrieving the image and displaying it on a flask App Engine site.
This current method requires the bucket objects to be public, which poses a security issue with the internet granted access to this bucket, secondly it is computational expensive, with each image having to be pulled down from the bucket separately. The bucket will eventually contain 10000 images, which will result in the images been slow to load and display on the web page.
Requirement (New Method):
Is there a method in which i can pull down images from the bucket, not all the images at once, and display them on a web page. I want to be able to access individual images from the bucket and display their corresponding image data, retrieved from the .json file.
Lastly i want to ensure that neither the bucket or the objects are public and can only be accessed via the app engine.
Thanks

Would be helpful to see the Python code that's doing the work right now. You shouldn't need the storage objects to be public. They should be able to be retrieved using the Google Cloud Storage (GCS) API and a service account token that has view-only permissions on storage (although depending on whether or not you know the object names and need to get the bucket name, it might require more permissions on the service account).
As for the performance, you could either do things on the front end to be smart about how many you're showing and fetch only what you want to display as the user scrolls, or you could paginate your results from the GCS bucket.
Links to the service account and API pieces here:
https://cloud.google.com/iam/docs/service-accounts
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Information about pagination for retrieving GCS objects here:
How does paging work in the list_blobs function in Google Cloud Storage Python Client Library

Google Cloud Storage with gspythonlibrary

I have to write a python script with which I should be able list other protected buckets owned by other people who are ready grand access. For this I created a project and enabled Google Cloud Storage in my Google API console, then I installed gsutil and stored 'my credentials' in '.boto' file. now I have to list metadata of all buckets where I have access. My question is what I or other buckets owners has to do to grant access to me/my project so that my script can list metadata of all buckets/object inside bucket?
I'm following this doc for python scripting: https://developers.google.com/storage/docs/gspythonlibrary

You can only list all buckets you have access to within a project.
There is no way to list all buckets which you might have access to. The list of buckets you have access to would be extremely large, because it would include all buckets marked as public.
I would suggest you have people who give you access to their buckets give you their bucket name explicitly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

S3 to Google Storage using transfer service - python

This is possible via using STS API. You can specify "path" at the destination bucket.

Related

Reading File from Vertex AI and Google Cloud Storage

Google cloud storage cycling through a directory

Deleting a very big folder in Google Cloud Storage

Access google cloud storage account objects from app engine

Google Cloud Storage with gspythonlibrary

Categories

Resources