How to make SDK connection to an S3 bucket persistent?

How to make SDK connection to an S3 bucket persistent? - python

I'm using the boto3 library to put objects in Amazon S3. I want to make a python service on my server, which is connected to the bucket in AWS, and whenever I send it a file path, it puts that in the bucket:
s3_resource = boto3.resource(
's3',
endpoint_url='...',
aws_access_key_id='...',
aws_secret_access_key='...'
)
bucket = s3_resource.Bucket('name')
for uploading, I send my requests to this method:
def upload(path):
bucket.put_object(...)
The connection to the bucket should be persistent so that whenever I call upload method, it quickly puts the object in the bucket and does not need to connect to the bucket every time.
How can I enable long-lived connections on my s3_resource?

Edit
The SDK tries to be an abstraction from the underlying API calls.
Whenever you want to put an object into an S3 bucket, that results in an API call. API calls are sent over the network to AWS and that requires establishing a connection to an AWS server. This connection can be kept open for a longer time, so it doesn't need to be re-established every time you want to make an API call. This helps reduce the network overhead, since establishing connections is relatively costly.
From your perspective these should be implementation details, you shouldn't have to worry about, since the SDK (boto3) takes care of that for you. There are some options to tweak how it handles things, but these are considered advanced options and you should know what you're doing ;-)
The lifecycle of the resources in boto3 is more or less independent from the underlying network connection. The way you will see this impact you, is through higher latencies, when there is no pre-existing connection that can be repurposed.
What you're looking for are the keep-alive options in boto3.
There is two levels on which these can be enabled:
TCP
You can set the tcp_keepalive option in the SDK config, which is set to false by default.
More detail on that can be found in the documentation.
HTTP
For HTTP-Keep alive, there is nothing you need to do explicitly - the underlying library handles that implicitly. There is a common optimization suggestion when using aws-sdk-js to mess with this, but the SDKs behave differently, that's not necessary in Python. There is a long discussion about this in a Github issue.
If you want to set configure the setting explicitly, you can use the event system to do that as this reply suggests:
def set_connection_header(request, operation_name, **kwargs):
request.headers['Connection'] = 'keep-alive'
ddb = boto3.client('dynamodb')
ddb.meta.events.register('request-created.dynamodb', set_connection_header)

Related

Lambda function in python to copy files from S3 to On prem server

I am new to AWS and have to copy a file from S3 bucket to an on-prem server.
As I understand the lambda would get triggered from S3 file upload event notification. But what could be used in lambda to send the file securely.

Your best bet may be to create a hybrid network so AWS lambda can talk directly to an on-prem server. Then you could directly copy the file server-to-server over the network. That's a big topic to cover, with lots of other considerations that probably go well beyond this simple question/answer.
You could send it via an HTTPS web request. You could easily write code to send it like this, but that implies you have something on the other end set up to receive it--some sort of HTTPS web server/API. Again, that could be big topic, but here's a description of how you might do that in Python.
Another option would be to use SNS to notify something on-premise whenever a file is uploaded. Then you could write code to pull the file down from S3 on your end. But the key thing is that this pull is initiated by code on your side. Maybe it gets triggered in response to an SNS email or something like that, but the network flow is on-premise fetching the file from S3 versus lambda pushing it to on-premise.
There are many other options. You probably need to do more research and decide your architectural approach before getting too deep into the implementation details.

Boto3 - How to keep session alive

I have a process that is supposed to run forever and needs to updates data on a S3 bucket on AWS. I am initializing the session using boto3:
session = boto3.session.Session()
my_s3 = session.resource(
"s3",
region_name=my_region_name,
aws_access_key_id=my_aws_access_key_id,
aws_secret_access_key=my_aws_secret_access_key,
aws_session_token=my_aws_session_token,
)
Since the process is supposed to run for days, I am wondering how I can make sure that the session is kept alive and working. Do I need to re-initialize the session sometimes?
Note: not sure if it is useful, but I have actually multiple threads each using its own session.
Thanks!

There is no concept of a 'session'. The Session() is simply an in-memory object that contains information about how to connect to AWS.
It does not actually involve any calls to AWS until an action is performed (eg ListBuckets). Actions are RESTful, and return results immediately. They do not keep open a connection.
A Session is not normally required. If you have stored your AWS credentials in a config file using the AWS CLI aws configure command, you can simply use:
import boto3
s3_resource = boto3.resource('s3')
The Session, however, is useful if you are using temporary credentials returned by an AssumeRole() command, rather than permanent credentials. In such a case, please note that credentials returned by AWS Security Token Service (STS) such as AssumeRole() have time limitations. This, however, is unrelated to the concept of a boto3 Session.

Connect google storage client python without creds.json

now to define Google storage client I'm using:
client = storage.Client.from_service_account_json('creds.json')
But I need to change client dynamically and prefer not deal with storing auth files to local fs.
So, is there some another way to connect by sending credentials as variable?
Something like for AWS and boto3:
iam_client = boto3.client(
'iam',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY
)
I guess, I miss something in docs and would be happy if someone point me where can I found this in docs.

If you want to use built-in methods, an option could be to create the constructor for the Client (Cloud Storage). In order to perform that actions these two links can be helpful.
Another possible option in order to avoid store auth files locally is using environment variable pointing to credentials outside of your applications code such as Cloud Key Management Service. To have more context about this you can take a look at this article.

What is the difference between uploading a file to S3 using boto3.resource.put_object() and boto3.s3.transfer.upload_file()

While I was referring to the sample codes to upload a file to S3 I found the following two ways.
Using boto3.resource.put_object():
s3_resource = boto3.resource('s3')
s3_resource.put_object(Bucket = BUCKET, Key = 'test', Body= b'some data')
Using boto3.s3.transfer.upload_file():
client = boto3.client('s3')
transfer = S3Transfer(client)
transfer.upload_file('/my_file', BUCKET, 'test')
I could not figure out the difference between the two ways. Are there any advantages of using one over another in any specific use cases. Can anyone please elaborate. Thank you.

The put_object method maps directly to the low-level S3 API request. It will attempt to send the entire body in one request. No multipart support boto3 docs
The upload_file method is handled by the S3 Transfer Manager, this means that it will automatically handle multipart uploads behind the scenes for you, if necessary.
Automatically switching to multipart transfers when
a file is over a specific size threshold
Uploading/downloading a file in parallel
Progress callbacks to monitor transfers
Retries. While botocore handles retries for streaming uploads,
it is not possible for it to handle retries for streaming
downloads. This module handles retries for both cases so
you don't need to implement any retry logic yourself.
This module has a reasonable set of defaults. It also allows you
to configure many aspects of the transfer process including: Multipart threshold size, Max parallel downloads, Socket timeouts, Retry amounts.boto3 docs

There is likely no difference - boto3 sometimes has multiple ways to achieve the same thing. See http://boto3.readthedocs.io/en/latest/guide/s3.html#uploads for more details on uploading files.

Low-level reading of a POST multipart request?

I have an example which I'm trying to create which, preferably using Django (or some other comparable framework), will immediately compress uploaded contents chunk-by-chunk into a strange compression format (be it LZMA, 7zip, etc.) which is then written out to another upload request to S3.
Essentially, this is what will happen:
A user initiates a multipart upload to my endpoint at ^/upload/?$.
As chunks are received on the server, (could be 1024 bytes or some other number) they are then passed through a compression algorithm in chunks.
The compressed output is written out over the wire to a S3 bucket.
Step 3 is optional; I could store the file locally and have a message queue do the uploads in a deferred way.
Is step 2 possible using a framework like Django? Is there a low-level way of accessing the incoming data in a file-like object?

The Django Request object provides a file-like interface so you can stream data from it. But, since Django always reads the whole Request into memory (or a temporary File if the file upload is too large) you can only use this API after the whole request is received. If your temporary storage directory is big enough and you do not mind buffering the data on your server you do not need to do anything special. Just upload the data to S3 inside the view. Be careful with timeouts though. If the upload to S3 takes too long the browser will receive a timeout. Therefore I would recommend moving the temporary files to a more permanent directory and initiating the upload via a worker queue like Celery.
If you want to stream directly from the client into Amazon S3 via your server I recommend using gevent. Using gevent you could write a simple greenlet that reads from a queue and writes to S3. This queue is filled by the original greenlet which reads from the request.
You could use a special upload URL like http://upload.example.com/ where you deploy that special server. The Django functions can be used from outside the Django framework if you set the DJANGO_SETTINGS_MODULE environment variable and take care of some things that the middlewares normally do for you (db connect/disconnect, transaction begin/commit/rollback, session handling, etc.).
It is even possible to run your custom WSGI app and Django together in the same WSGI container. Just wrap the Django WSGI app and intercept requests to /upload/. In this case I would recommend using gunicorn with the gevent worker-class as server.
I am not too familiar with the Amazon S3 API, but as far as I know you can also generate a temporary token for file uploads directly from your users. That way you would not need to tunnel the data through your server at all.
Edit: You can indeed allow anonymous uploads to your buckets. See this question which talks about this topic: S3 - Anonymous Upload - Key prefix

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.