I have a server which files get uploaded to, I want to be able to forward these on to s3 using boto, I have to do some processing on the data basically as it gets uploaded to s3.
The problem I have is the way they get uploaded I need to provide a writable stream that incoming data gets written to and to upload to boto I need a readable stream. So it's like I have two ends that don't connect. Is there a way to upload to s3 with a writable stream? If so it would be easy and I could pass upload stream to s3 and it the execution would chain along.
If there isn't I have two loose ends which I need something in between with a sort of buffer, that can read from the upload to keep that moving, and expose a read method that I can give to boto so that can read. But doing this I'm sure I'd need to thread the s3 upload part which I'd rather avoid as I'm using twisted.
I have a feeling I'm way over complicating things but I can't come up with a simple solution. This has to be a common-ish problem, I'm just not sure how to put it into words very well to search it
boto is a Python library with a blocking API. This means you'll have to use threads to use it while maintaining the concurrence operation that Twisted provides you with (just as you would have to use threads to have any concurrency when using boto ''without'' Twisted; ie, Twisted does not help make boto non-blocking or concurrent).
Instead, you could use txAWS, a Twisted-oriented library for interacting with AWS. txaws.s3.client provides methods for interacting with S3. If you're familiar with boto or AWS, some of these should already look familiar. For example, create_bucket or put_object.
txAWS would be better if it provided a streaming API so you could upload to S3 as the file is being uploaded to you. I think that this is currently in development (based on the new HTTP client in Twisted, twisted.web.client.Agent) but perhaps not yet available in a release.
You just have to wrap the stream in a file like object. So essentially the stream object should have a read method that blocks until the file has been completely uploaded.
After that you just use the s3 API
bucketname = 'my_bucket'
conn = create_storage_connection()
buckets = conn.get_all_buckets()
bucket = None
for b in buckets:
if b.name == bucketname:
bucket = b
if not bucket:
raise Exception('Bucket with name ' + bucketname + ' not found')
k = Key(bucket)
k.key = key
k.set_contents_from_filename(MyFileLikeStream)
Related
Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.
I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?
While I was referring to the sample codes to upload a file to S3 I found the following two ways.
Using boto3.resource.put_object():
s3_resource = boto3.resource('s3')
s3_resource.put_object(Bucket = BUCKET, Key = 'test', Body= b'some data')
Using boto3.s3.transfer.upload_file():
client = boto3.client('s3')
transfer = S3Transfer(client)
transfer.upload_file('/my_file', BUCKET, 'test')
I could not figure out the difference between the two ways. Are there any advantages of using one over another in any specific use cases. Can anyone please elaborate. Thank you.
The put_object method maps directly to the low-level S3 API request. It will attempt to send the entire body in one request. No multipart support boto3 docs
The upload_file method is handled by the S3 Transfer Manager, this means that it will automatically handle multipart uploads behind the scenes for you, if necessary.
Automatically switching to multipart transfers when
a file is over a specific size threshold
Uploading/downloading a file in parallel
Progress callbacks to monitor transfers
Retries. While botocore handles retries for streaming uploads,
it is not possible for it to handle retries for streaming
downloads. This module handles retries for both cases so
you don't need to implement any retry logic yourself.
This module has a reasonable set of defaults. It also allows you
to configure many aspects of the transfer process including: Multipart threshold size, Max parallel downloads, Socket timeouts, Retry amounts.boto3 docs
There is likely no difference - boto3 sometimes has multiple ways to achieve the same thing. See http://boto3.readthedocs.io/en/latest/guide/s3.html#uploads for more details on uploading files.
I'm using boto3 and trying to upload files. It will be helpful if anyone will explain exact difference between file_upload() and put_object() s3 bucket methods in boto3 ?
Is there any performance difference?
Does anyone among these handles multipart upload feature in behind the scenes?
What are the best use cases for both?
The upload_file method is handled by the S3 Transfer Manager, this means that it will automatically handle multipart uploads behind the scenes for you, if necessary.
The put_object method maps directly to the low-level S3 API request. It does not handle multipart uploads for you. It will attempt to send the entire body in one request.
One other difference I feel might be worth noticing is upload_file() API allows you to track upload using callback function. You can check about it here.
Also as already mentioned by boto's creater #garnaat that upload_file() uses multipart behind the scenes so its not straight forward to check end to end file integrity (there exists a way) but put_object() uploads whole file at one shot (capped at 5GB though) making it easier to check integrity by passing Content-MD5 which is already provided as a parameter in put_object() API.
One other thing to mention is that put_object() requires a file object whereas upload_file() requires the path of the file to upload. For example, if I have a json file already stored locally then I would use upload_file(Filename='/tmp/my_file.json', Bucket=my_bucket, Key='my_file.json').
Whereas if I had a dict within in my job, I could transform the dict into json and use put_object() like so:
records_to_update = {'Name': 'Sally'}
records_to_update_json = json.dumps(records_to_update, default=str)
put_object(Body=records_to_update_json, Bucket=my_bucket, Key='my_records')
I have a general question regarding uploads from a client (in this case an iPhone App) to S3. I'm using Django to write my webservice on an EC2 instance. The following method is the bare minimum to upload a file to S3 and it works very well with smaller files (jpgs or pngs < 1 MB):
def store_in_s3(filename, content):
conn = S3Connection(settings.ACCESS_KEY, settings.PASS_KEY) # gets access key and pass key from settings.py
bucket = conn.create_bucket('somebucket')
k = Key(bucket) # create key on this bucket
k.key = filename
mime = mimetypes.guess_type(filename)[0]
k.set_metadata('Content-Type', mime)
k.set_contents_from_string(content)
k.set_acl('public-read')
def uploadimage(request, name):
if request.method == 'PUT':
store_in_s3(name,request.raw_post_data)
return HttpResponse("Uploading raw data to S3 ...")
else:
return HttpResponse("Upload not successful!")
I'm quite new to all of this, so I still don't understand what happens here. Is it the case that:
Django receives the file and saves it in the memory of my EC2 instance?
Should I avoid using raw_post_data and rather do chunks to avoid memory issues?
Once Django has received the file, will it pass the file to the function store_in_s3?
How do I know if the upload to S3 was successful?
So in general, I wonder if one line of my code will be executed after another. E.g. when will return HttpResponse("Uploading raw data to S3 ...") fire? While the file is still uploading or after it was successfully uploaded?
Thanks for your help. Also, I'd be grateful for any docs that treat this subject. I had a look at the chapters in O'Reilly's Python & AWS Cookbook, but as it's only code samples, it doesn't really answer my questions.
Django stores small uploaded files in memory. Over a certain size, and it will store it on a temp file on disk.
Yes, chunking is helpful for memory savings as well:
for file_chunk in uploaded_file.chunks():
saved_file.write(file_chunk)
All of these operations are synchronous, so Django will wait until the file is fully uploaded before it will attempt to store it in S3. S3 will have to complete its upload before it will return as well, so you are guaranteed that it will be uploaded through Django and to S3 before you will receive your HttpResponse()
Django's File Uploads documentation has a lot of great info on how to handle various upload situations.
You might want to look at django-storages. It facilitates storing files on S3 and a bunch of other services/platforms.