Transfering data from gcs to s3 with google-cloud-storage - python

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.

I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.

Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?

Related

How to send/copy/upload file from AWS S3 to Google GCS using Python

Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.

Convert s3 video files resolution and store it back to s3

I want to convert s3 bucket video file resolutions to other resolutions and store it back to s3.
I know we can use ffmpeg locally to change the resolution.
Can we use ffmpeg to convert s3 files and store it back using Django?
I am puzzled as from where to start and what should be the process.
Whether I should have the video file from s3 bucket in buffer and then convert it using ffmpeg and then upload it back to s3.
Or is there anyway to do it directly without having to keep it in buffer.
You can break the problem into 3 parts:
Get the file from s3: use boto3 & the AWS APIs to download the file
Convert the file locally: ffmpeg has got you covered here. It'll generate a new file
Upload the file back to s3: use boto3 again
This is fairly simple and robust for a script.
Now if you want to optimize you can try using something like s3fs to mount your s3 bucket to local & do the conversion

python upload data, not file, to s3 bucket

I know how to upload a file to s3 buckets in Python. I am looking for a way to upload data to a file in s3 bucket directly. In this way, I do not need to save my data to a local file, and then upload the file. Any suggestions? Thanks!
AFAIK standard Object.put() supports this.
resp = s3.Object('bucket_name', 'key/key.txt').put(Body=b'data')
Edit: it was pointed out that you might want the client method, which is just put_object with the kwargs differently organized
client.put_object(Body=b'data', Bucket='bucket_name', Key='key/key.txt')

Python script to use data from Azure Storage Blob by stream, and update blob by stream without local file reading and uploading

I have a python code for data processing , i want to use azure block blob as the data input for the code, to be specify, a csv file from block blob. its all good to download the csv file from azure blob to local path, and upload other way around for the python code if running locally, but the problem is my code running on azure virtual machine because its quite heavy for my Apple Air , pandas read_csv from local path does not work in this case, therefore i have to download and upload and update csv files to azure storage by stream without local saving. both download and upload csv are quite small volume, much less than the blob block limits
there wasn't that much tutorial to explain how to do this step by step, MS Docs are generally suck to explain as well, my minimal code are as following:
for downloading from azure blob storage
from azure.storage.blob import BlockBlobService
storage = BlockBlobService(account_name='myname', account_key = 'mykey')
#here i don't know how to make a csv stream that could could be used in next steps#
file = storage.get_blob_to_stream('accountname','blobname','stream')
df = pd.read_csv(file)
#df for later steps#
for uploading and updating a blob by a dataframe from code by stream
df #dataframe generated by code
'i don't know how to do the preparation steps for df and the final fire up operation'
storage.put_blob_to_list _by_stream('accountname','blobname','stream')
can you please make a step by step tutorial for me, for ppl has experience to azure blob , this should be not very difficult.
or if you have better solution other than use blob for my case , please drop some hits. Thanks.
So the document is still in progress, I think it is getting better and better...
Useful link:
Github - Microsoft Azure Storage SDK for Python
Quickstart: Upload, download, and list blobs using Python
To download a file as a stream from blob storage, you can use BytesIO:
from azure.storage.blob import BlockBlobService
from io import BytesIO
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Do whatever you want to do - here I am just copying the input stream to the output stream
copyfileobj(input_blob, output_blob)
...
# Create the a new blob
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
# Or update the same blob
block_blob_service.create_blob_from_stream('mycontainer', 'myinputfilename', output_blob)

Any way to write files DIRECTLY to S3 using boto3?

I wrote a python script to process very large files (few TB in total), which I'll run on an EC2 instance. Afterwards, I want to store the processed files in an S3 bucket. Currently, my script first saves the data to disk and then uploads it to S3. Unfortunately, this will be quite costly given the extra time spent waiting for the instance to first write to disk and then upload.
Is there any way to use boto3 to write files directly to an S3 bucket?
Edit: to clarify my question, I'm asking if I have an object in memory, writing that object directly to S3 without first saving the object onto disk.
You can use put_object for this. Just pass in your file object as body.
For example:
import boto3
client = boto3.client('s3')
response = client.put_object(
Bucket='your-s3-bucket-name',
Body='bytes or seekable file-like object',
Key='Object key for which the PUT operation was initiated'
)
It's working with the S3 put_object method:
key = 'filename'
response = s3.put_object(Bucket='Bucket_Name',
Body=json_data,
Key=key)

Categories

Resources