I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.
The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.
#component(
packages_to_install=["google-cloud"],
base_image="python:3.9"
)
def main(
):
import csv
from io import StringIO
from google.cloud import storage
BUCKET_NAME = "gs://my_bucket"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('test/test.txt')
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
The error I'm getting looks like the following:
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found
What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.
Try specifying bucket name w/o a gs://. This should fix the issue. One more stackoverflow post that says the same thing: Cloud Storage python client fails to retrieve bucket
any storage bucket you try to access in GCP has a unique address to access it. That address starts with a gs:// always which specifies that it is a cloud storage url. Now, GCS apis are designed such that they need the bucket name only to work with it. Hence, you just pass the bucket name. If you were accessing the bucket via browser you will need the complete address to access and hence the gs:// prefix as well.
Related
I wish to download a file (https://covid.ourworldindata.org/data/owid-covid-data.csv) from the internet using Python and Google Cloud.
Currently, I have this code.
import os
import wget
from google.cloud import storage
url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']
cf_path = '/tmp/{}'.format(file_name)
def import_file(event, context):
# set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket(bucket_name)
# download the file to Cloud Function's tmp directory
wget.download(url, cf_path)
# set Blob
blob = storage.Blob(file_name, bucket)
# upload the file to GCS
blob.upload_from_filename(cf_path)
print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))
While this code works beautifully, the Covid19 data updates on a daily with addition of new date (So if I access the file on 3/7, it would include data till 3/6) . Instead of re-writing the whole file again, I wish to only extract the newly updated rows into google storage for every time the function is run on scheduled as opposed to overwriting the file that was already saved.
I fairly weal in programming and would appreciate the help.
While the file is in csv format, there's also a JSON link (https://covid.ourworldindata.org/data/owid-covid-data.json) if it would make it easier to code.
I can figure out the portion on storing it to Cloud Storage but requires help on the code to extract the most updated rows/data in more specifically.
The usual best practice is to load data everyday in BigQuery and to partition per ingestion date.
Then, you can run a query (or create a view) to select only the most recent data of type (use the partition over syntax) (deduplicate)
Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.
Let me preface this by saying I am completely new. I code in SQL and SAS.
I need to list all objects stored in a Google Cloud storage bucket. The web GUI is insufficient as I am trying to locate one file among more than 6K files.
I'm in Google Cloud Datalab and using Python 3.6. What is the easiest way to simply create a list (preferably something I can kick out to a local csv) of these objects?
Thank you
As explained here the following code will list all the objects in a bucket.
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
In blob.name you have the name of each object, in the code we print it to standard output. As you are saying you would like to have this output to a text file you can do it by appending this variable.
The docs for Storage are pretty solid, and have links to the GitHub repos that contains all the client code.
There's also a commandline utility gsutil which makes this stuff pretty trivial in bash scripts and the like.
I made a publicly listable bucket on google cloud storage. I can see all the keys if I try to list the bucket objects in the browser. I was trying to use the create_anonymous_client() function so that I can list the bucket keys in the python script. It is giving me an exception. I looked up everywhere and still can't find the proper way to use the function.
from google.cloud import storage
client = storage.Client.create_anonymous_client()
a = client.lookup_bucket('publically_listable_bucket')
a.list_blobs()
Exception I am getting:
ValueError: Anonymous credentials cannot be refreshed.
Additional Query: Can I list and download contents of public google cloud storage buckets using boto3, If yes, how to do it anonymously?
I was also struggling with thing and couldn't find an answer anywhere online. Turns out you can access the bucket with just the bucket() method.
I'm not sure why, but this method can take several seconds sometimes.
client = storage.Client.create_anonymous_client()
bucket = client.bucket('publically_listable_bucket')
blobs = list(bucket.list_blobs())
This error means the bucket you are attempting to list does not grant the right permission. You must Give "Storage Object Viewer" or "Storage Legacy Bucket Reader" role to "allUsers".
I have my data on google cloud platform and i want to be able to be able to download it locally, this is my first time trying that and eventually i'll use the downloaded data with my python code.
I have checked the docs, like https://cloud.google.com/genomics/downloading-credentials-for-api-access and https://cloud.google.com/storage/docs/cloud-console i have successfully got the Json file for my first link, the second one is where u'm struggling, i'm using python 3.5 and assuming my json files name is data.json i have added the following code:
os.environ["file"] = "data.json"
urllib.request.urlopen('https://storage.googleapis.com/[bucket_name]/[filename]')
first of all i don't even know what i should call the value near environ so i just called it file, not sure how i'm supposed to fill it and i got access denied on the second line, obviously it's not how to download my file as there is no destination local repository or anything in that command any guidance will be appreciated.
Edit:
from google.cloud.storage import Blob
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials/client_secret.json"
storage_client = storage.Client.from_service_account_json('service_account.json')
client = storage.Client(project='my-project')
bucket = client.get_bucket('my-bucket')
blob = Blob('path/to/my-object', bucket)
download_to_filename('local/path/to/my-file')
I'm getting unresolved reference for storage and download_to_filename and should i replace service_account.json with credentials/client_secret.json. Plus i tried to print the content of os.environ["GOOGLE_APPLICATION_CREDENTIALS"]['installed'] like i'd do with any Json but it just said i should give numbers meaning it read the input path as regular text only.
You should use the idiomatic Google Cloud library to run operations in GCS.
With the example there, and knowing that the client library will get the application default credentials, first we have to set the applicaiton default credentials with
gcloud auth application-default login
===EDIT===
That was the old way. Now you should use the instructions in this link.
This means downloading a service account key file from the console, and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded JSON.
Also, make sure that this service account has the proper permissions on the project of the bucket.
Or you can create the client with explicit credentials. You'll need to download the key file all the same, but when creating the client, use:
storage_client = storage.Client.from_service_account_json('service_account.json')
==========
And then, following the example code:
from google.cloud import storage
client = storage.Client(project='project-id')
bucket = client.get_bucket('bucket-id')
blob = storage.Blob('bucket/file/path', bucket)
blob.download_to_filename('/path/to/local/save')
Or, if this is a one-off download, just install the SDK and use gsutil to download:
gsutil cp gs://bucket/file .