Python: Upload file to Google Cloud Storage using URL - python

I have a URL (https://example.com/myfile.txt) of a file and I want to upload it to my bucket (gs://my-sample-bucket) on Google Cloud Storage.
What I am currently doing is:
Downloading the file to my system using the requests library.
Uploading that file to my bucket using python function.
Is there any way I can upload the file directly using the URL.

You can use urllib2 or requests library to get the file from HTTP, then your existing python code to upload to Cloud Storage. Something like this should work:
import urllib2
from google.cloud import storage
client = storage.Client()
filedata = urllib2.urlopen('http://example.com/myfile.txt')
datatoupload = filedata.read()
bucket = client.get_bucket('bucket-id-here')
blob = Blob("myfile.txt", bucket)
blob.upload_from_string(datatoupload)
It still downloads the file into memory on your system, but I don't think there's a way to tell Cloud Storage to do that for you.

There is a way to do this, using a Cloud Storage Transfer job, but depending on your use case, it may be worth doing or not. You would need to create a transfer job to transfer a URL list.
I marked this question as duplicated from this.

Related

Download specific rows of data from a file from a website using Python and Google Cloud

I wish to download a file (https://covid.ourworldindata.org/data/owid-covid-data.csv) from the internet using Python and Google Cloud.
Currently, I have this code.
import os
import wget
from google.cloud import storage
url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']
cf_path = '/tmp/{}'.format(file_name)
def import_file(event, context):
# set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket(bucket_name)
# download the file to Cloud Function's tmp directory
wget.download(url, cf_path)
# set Blob
blob = storage.Blob(file_name, bucket)
# upload the file to GCS
blob.upload_from_filename(cf_path)
print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))
While this code works beautifully, the Covid19 data updates on a daily with addition of new date (So if I access the file on 3/7, it would include data till 3/6) . Instead of re-writing the whole file again, I wish to only extract the newly updated rows into google storage for every time the function is run on scheduled as opposed to overwriting the file that was already saved.
I fairly weal in programming and would appreciate the help.
While the file is in csv format, there's also a JSON link (https://covid.ourworldindata.org/data/owid-covid-data.json) if it would make it easier to code.
I can figure out the portion on storing it to Cloud Storage but requires help on the code to extract the most updated rows/data in more specifically.
The usual best practice is to load data everyday in BigQuery and to partition per ingestion date.
Then, you can run a query (or create a view) to select only the most recent data of type (use the partition over syntax) (deduplicate)

How to upload from Digital Ocean Storage to Google Cloud Storage, directly, programatically without rclone

I want to migrate files from Digital Ocean Storage into Google Cloud Storage programatically without rclone.
I know the exact location file that resides in the Digital Ocean Storage(DOS), and I have the signed url for the Google Cloud Storage(GCS).
How can I modify the following code so I can copy the DOS file directly into GCS without intermediate download to my computer ?
def upload_to_gcs_bucket(blob_name, path_to_file, bucket_name):
""" Upload data to a bucket"""
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'creds.json')
#print(buckets = list(storage_client.list_buckets())
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.upload_from_filename(path_to_file)
#returns a public url
return blob.public_url
Google's Storage Transfer Servivce should be an answer for this type of problem (particularly because DigitalOcean Spaces like most is S3-compatible. But (!) I think (I'm unfamiliar with it and unsure) it can't be used for this configuration.
There is no way to transfer files from a source to a destination without some form of intermediate transfer but what you can do is use memory rather than using file storage as the intermediary. Memory is generally more constrained than file storage and if you wish to run multiple transfers concurrently, each will consume some amount of storage.
It's curious that you're using Signed URLs. Generally Signed URLs are provided by a 3rd-party to limit access to 3rd-party buckets. If you own the destination bucket, then it will be easier to use Google Cloud Storage buckets directly from one of Google's client libraries, such as Python Client Library.
The Python examples include uploading from file and from memory. It will likely be best to stream the files into Cloud Storage if you'd prefer to not create intermediate files. Here's a Python example

Reading File from Vertex AI and Google Cloud Storage

I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.
The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.
#component(
packages_to_install=["google-cloud"],
base_image="python:3.9"
)
def main(
):
import csv
from io import StringIO
from google.cloud import storage
BUCKET_NAME = "gs://my_bucket"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('test/test.txt')
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
The error I'm getting looks like the following:
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found
What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.
Try specifying bucket name w/o a gs://. This should fix the issue. One more stackoverflow post that says the same thing: Cloud Storage python client fails to retrieve bucket
any storage bucket you try to access in GCP has a unique address to access it. That address starts with a gs:// always which specifies that it is a cloud storage url. Now, GCS apis are designed such that they need the bucket name only to work with it. Hence, you just pass the bucket name. If you were accessing the bucket via browser you will need the complete address to access and hence the gs:// prefix as well.

How to send/copy/upload file from AWS S3 to Google GCS using Python

Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.

get blob_key from an existed file in cloudstorage

In Google Appengine, is it possible to retrieve the blob_key from google cloud storage filepath? The file is upload to cloud storage directly.
The document only shows how to create a file in cloud storage and get the blob_key.
https://developers.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
It's possible you don't actually need such a key: if you uploaded the file directly, I assume you know the bucket name and object name/path? With this you can serve the object by constructing a URL (e.g. http://storage.googleapis.com/<bucket-name>/<object-path>) and you can read it using the Cloud Storage Client Library's open() function.
I've not used blob_keys myself, but the use case for them seems to be where an uploaded file needs to be served from a location which is unknown and therefore a GCS URL cannot be formed.

Categories

Resources