I wrote a python script to process very large files (few TB in total), which I'll run on an EC2 instance. Afterwards, I want to store the processed files in an S3 bucket. Currently, my script first saves the data to disk and then uploads it to S3. Unfortunately, this will be quite costly given the extra time spent waiting for the instance to first write to disk and then upload.
Is there any way to use boto3 to write files directly to an S3 bucket?
Edit: to clarify my question, I'm asking if I have an object in memory, writing that object directly to S3 without first saving the object onto disk.
You can use put_object for this. Just pass in your file object as body.
For example:
import boto3
client = boto3.client('s3')
response = client.put_object(
Bucket='your-s3-bucket-name',
Body='bytes or seekable file-like object',
Key='Object key for which the PUT operation was initiated'
)
It's working with the S3 put_object method:
key = 'filename'
response = s3.put_object(Bucket='Bucket_Name',
Body=json_data,
Key=key)
Related
Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.
I know how to upload a file to s3 buckets in Python. I am looking for a way to upload data to a file in s3 bucket directly. In this way, I do not need to save my data to a local file, and then upload the file. Any suggestions? Thanks!
AFAIK standard Object.put() supports this.
resp = s3.Object('bucket_name', 'key/key.txt').put(Body=b'data')
Edit: it was pointed out that you might want the client method, which is just put_object with the kwargs differently organized
client.put_object(Body=b'data', Bucket='bucket_name', Key='key/key.txt')
I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?
I have a server which files get uploaded to, I want to be able to forward these on to s3 using boto, I have to do some processing on the data basically as it gets uploaded to s3.
The problem I have is the way they get uploaded I need to provide a writable stream that incoming data gets written to and to upload to boto I need a readable stream. So it's like I have two ends that don't connect. Is there a way to upload to s3 with a writable stream? If so it would be easy and I could pass upload stream to s3 and it the execution would chain along.
If there isn't I have two loose ends which I need something in between with a sort of buffer, that can read from the upload to keep that moving, and expose a read method that I can give to boto so that can read. But doing this I'm sure I'd need to thread the s3 upload part which I'd rather avoid as I'm using twisted.
I have a feeling I'm way over complicating things but I can't come up with a simple solution. This has to be a common-ish problem, I'm just not sure how to put it into words very well to search it
boto is a Python library with a blocking API. This means you'll have to use threads to use it while maintaining the concurrence operation that Twisted provides you with (just as you would have to use threads to have any concurrency when using boto ''without'' Twisted; ie, Twisted does not help make boto non-blocking or concurrent).
Instead, you could use txAWS, a Twisted-oriented library for interacting with AWS. txaws.s3.client provides methods for interacting with S3. If you're familiar with boto or AWS, some of these should already look familiar. For example, create_bucket or put_object.
txAWS would be better if it provided a streaming API so you could upload to S3 as the file is being uploaded to you. I think that this is currently in development (based on the new HTTP client in Twisted, twisted.web.client.Agent) but perhaps not yet available in a release.
You just have to wrap the stream in a file like object. So essentially the stream object should have a read method that blocks until the file has been completely uploaded.
After that you just use the s3 API
bucketname = 'my_bucket'
conn = create_storage_connection()
buckets = conn.get_all_buckets()
bucket = None
for b in buckets:
if b.name == bucketname:
bucket = b
if not bucket:
raise Exception('Bucket with name ' + bucketname + ' not found')
k = Key(bucket)
k.key = key
k.set_contents_from_filename(MyFileLikeStream)
I have a general question regarding uploads from a client (in this case an iPhone App) to S3. I'm using Django to write my webservice on an EC2 instance. The following method is the bare minimum to upload a file to S3 and it works very well with smaller files (jpgs or pngs < 1 MB):
def store_in_s3(filename, content):
conn = S3Connection(settings.ACCESS_KEY, settings.PASS_KEY) # gets access key and pass key from settings.py
bucket = conn.create_bucket('somebucket')
k = Key(bucket) # create key on this bucket
k.key = filename
mime = mimetypes.guess_type(filename)[0]
k.set_metadata('Content-Type', mime)
k.set_contents_from_string(content)
k.set_acl('public-read')
def uploadimage(request, name):
if request.method == 'PUT':
store_in_s3(name,request.raw_post_data)
return HttpResponse("Uploading raw data to S3 ...")
else:
return HttpResponse("Upload not successful!")
I'm quite new to all of this, so I still don't understand what happens here. Is it the case that:
Django receives the file and saves it in the memory of my EC2 instance?
Should I avoid using raw_post_data and rather do chunks to avoid memory issues?
Once Django has received the file, will it pass the file to the function store_in_s3?
How do I know if the upload to S3 was successful?
So in general, I wonder if one line of my code will be executed after another. E.g. when will return HttpResponse("Uploading raw data to S3 ...") fire? While the file is still uploading or after it was successfully uploaded?
Thanks for your help. Also, I'd be grateful for any docs that treat this subject. I had a look at the chapters in O'Reilly's Python & AWS Cookbook, but as it's only code samples, it doesn't really answer my questions.
Django stores small uploaded files in memory. Over a certain size, and it will store it on a temp file on disk.
Yes, chunking is helpful for memory savings as well:
for file_chunk in uploaded_file.chunks():
saved_file.write(file_chunk)
All of these operations are synchronous, so Django will wait until the file is fully uploaded before it will attempt to store it in S3. S3 will have to complete its upload before it will return as well, so you are guaranteed that it will be uploaded through Django and to S3 before you will receive your HttpResponse()
Django's File Uploads documentation has a lot of great info on how to handle various upload situations.
You might want to look at django-storages. It facilitates storing files on S3 and a bunch of other services/platforms.