Copy files from S3 to Google Cloud storage using boto - python

I'm trying to automate copying of files from S3 to Google Cloud Storage inside a python script.
Everywhere I look people recommend using gsutil as a command line utility.
Does anybody know if this copies files directly? Or does it first download the files to the computer and then uploads them to GS?
Can this be done using the boto library and google's OAuth2 plugin?
This is what i've got from google's documentation and a little bit or trial-error:
src_uri = boto.storage_uri('bucket/source_file')
dst_uri = boto.storage_uri('bucket/destination_file', 'gs')
object_contents = StringIO.StringIO()
src_uri.get_key().get_file(object_contents)
object_contents.seek(0)
dst_uri.set_contents_from_file(object_contents)
object_contents.close()
From what I understand I'm reading from a file into an object in the host where the script is running, and later uploading such content into a file in GS.
Is this right?

Since this question was asked, GCS Transfer Service has become available. If you want to copy files from S3 to GCS without an intermediate, this is a great option.

Related

Best way to write audio files to google cloud storage

I'm processing an input audio file from a google bucket using ffmpeg and writing the output to another google bucket. For this, I'm initially writing it to the /tmp folder and then uploading the file to the google bucket using blob.upload_from_filename(local_path). However I read that writing to /tmp increases RAM usage and I'd like to know of a better way (perhaps writing directly to the google bucket).
I considered signed urls, but whilst they're useful reading the audio files from the input bucket, I couldn't find a way to write to a file using it's signed url.
Google's documentation for GCS for Streaming transfers for Python suggests using resumable uploads
Writing to /tmp does not increase RAM consumption unless you have mounted a RAMDISK to /tmp which some distributions do by default. To find out if /tmp is a RAMDISK, see e.g. http://linuxintro.org/wiki/Ramdisk
If you do not want to create a local file at all, I'd use upload_from_string as suggested in Upload file to cloud storage from request without saving it locally

Is there a way to create a Python script on Google Cloud Function that Downloads a file from a Bucket to your local computer?

Currently, I know of two ways to download a file from a bucket to your computer
1) Manually go to the bucket and clicking download
2) Use gsutil
Is there a way to do this in a program in Google Cloud Functions? You can't execute gsutil in a Python script. I also tried this:
with open("C:\", "wb") as file_obj:
blob.download_to_file(file_obj)
but I believe that looks for a directory on Google Cloud. Please help!
The code you are using is for downloading the files from your local machine, this is covered in this document, however, as you mention this would look for the local file in the machine executing the Cloud Function.
I would suggest you to create a cron to download the files to your local instance through gsutil, doing this through a Cloud Function is not going to be possible.
Hope you find this useful.
You can't directly achieve what you want. Cloud Function must be able to reach your local computer for copying the file on it.
The common way for sending file to a computer is to use FTP protocol. So install a FTP server on your computer and to set up your function for reading into your bucket and then send the file to your FTP server (you have to get your public IP, be sure that the firewall rules/routers are configured for this,...). It's not the easiest way.
A gsutil with a rsync command work perfectly. Use a planned task on Windows if you want to cron this.

How to download file from website to S3 bucket without having to download to local machine

I'm trying to download a dataset from a website. However all the files I want to download add up to about 100 gb which I don't want to download to my local machine, then upload to s3. Is there a way to download directly to an s3 bucket? Or do you have to use ec2, and if so could somebody give brief instructions on how to do this? Thanks
S3's put_object() method supports a Body parameter for Bytes (or file):
Python example:
response = client.put_object(
Body=b'bytes'|file,
Bucket='string',
Key='string',
)
So if you download a webpage, using Python you'd use the requests.Get() method or .Net you use either HttpWebRequest or WebClient and then upload the file as a byte array so you never need to save it locally. It can all be done in memory.
Or do you have to use ec2
An Ec2 is just a VM in the cloud, you can programatically do this task (download 100gb to S3) from your Desktop PC/Laptop. Simply open an Command Window or a Terminal and type:
AWS Configure
Put in an IAM users creds and use the aws cli or use an AWS SDK like the python example above. You can give the S3 Bucket a Policy Document that will allow access to the IAM user. This will download everything to your local machine.
If you want to run this on an EC2 and avoid downloading everything to your local PC modify the role assigned to the EC2 and give it Put privileges to S3. This will be the easiest and most secure. If you use the in-memory and bytes approach it will download all the data but it wont save it to disk.

How to copy files from cloud storage to other cloud? For example, Google Drive to OneDrive

I want to copy the files from Google Drive to OneDrive using any APIs in Python. Google provides some APIs but I don't see anything related to copy the files to another cloud.
One way to achieve this is like download files from Google Drive using Google Drive API and again upload to OneDrive.
Please share some inputs if there is a way to achieve this.
If you are running both services on your desktop, you could just use python to move the files from one folder to another. See How to move a file in Python. By moving the file from your google drive folder to your onedrive folder, the services should automatically sync and upload the files.
(As an aside, another solution if you don't care how the problem gets solved a service like https://publist.app might be of use).

interface between google colaboratory and google cloud

From google colaboratory, if I want to read/write to a folder in a given bucket created in google cloud, how do I achieve this?
I have created a bucket, a folder within the bucket and uploaded bunch of images into it. Now from colaboratory, using jupyter notebook, want to create multiple sub-directories to organise these images into train, validation and test folders.
Subsequently access respective folders for training, validating and testing the model.
With Google drive, we just update the path to direct to specific directory with following commands, after authentication.
import sys
sys.path.append('drive/xyz')
We do some thing similar on desktop version also
import os
os.chdir(local_path)
Does some thing similar exist for Google Cloud Storage?
I colaboratory FAQs, it has procedure for reading and writing a single file, where we need to set the entire path. That will be tedious to re-organise a main directory into sub-directories and access them separately.
In general it's not a good idea to try to mount a GCS bucket on the local machine (which would allow you to use it as you mentioned). From Connecting to Cloud Storage buckets:
Note: Cloud Storage is an object storage system that does not have the
same write constraints as a POSIX file system. If you write data
to a file in Cloud Storage simultaneously from multiple sources, you
might unintentionally overwrite critical data.
Assuming you'd like to continue regardless of the warning, if you use a Linux OS you may be able to mount it using the Cloud Storage FUSE adapter. See related How to mount Google Bucket as local disk on Linux instance with full access rights.
The recommended way to access GCS from python apps is using the Cloud Storage Client Libraries, but accessing files will be different
than in your snippets. You can find some examples at Python Client for Google Cloud Storage:
from google.cloud import storage
client = storage.Client()
# https://console.cloud.google.com/storage/browser/[bucket-id]/
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Update:
The Colaboratory doc recommends another method that I forgot about, based on the Google API Client Library for Python, but note that it also doesn't operate like a regular filesystem, it's using an intermediate file on the local filesystem:
uploading files to GCS
downloading files from GCS:

Categories

Resources