S3 presigned post urls issue using boto3

S3 presigned post urls issue using boto3 - python

I have a webapp served with apache2 running python-flask in the backend. The app is hosted on Linode and heavily relies on their S3 Object Storage. I'm using boto3 to interact with the S3 storage. My issue is regarding the generate_presigned_url method when used in production. It returns the following structure:
{
'url': 'https://eu-central-1.linodeobjects.com/my-s3-bucket',
'fields': {
'ACL': 'private',
'key': 'foo.bar',
'AWSAccessKeyId': 'FOOBAR',
'policy': 'base64longhash...',
'signature': 'foobar'
}
}
Everytime I use this method on the same python session the policy key returns a longer value (about 1.5x increase in length for every subsequent request). After a few requests the size of the policy gets really large (tens of MB) and the app breaks. If I restart the python service the policy size gets reset.
After digging in the boto3 documentation and some threads in GitHub and here I couldn't find anything that helped me in regards to resetting the S3 connection without having to restart the whole python session. To keep restarting the apache2 service periodically is not a good approach, so my solution was to call the generate_presigned_url from a standalone script using subprocess and parse the string output back to json before using it, which is not ideal, as I wish I didn't have to keep calling bash scripts from inside apache. The main functions I use follow bellow:
AWS_BUCKET_PARAMS = {'ACL': 'private'}
# connect to my linode's s3 bucket
def awsSign():
return boto3.client('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY, endpoint_url=AWS_ENDPOINT_URL)
# generate presigned post object for uploading files
def awsPostForm(file_path):
s3 = awsSign()
return s3.generate_presigned_post(AWS_BUCKET, file_path, AWS_BUCKET_PARAMS, [AWS_BUCKET_PARAMS], 1800)
# generate post object from external script
def awsPostFormTerminal(file_path):
from subprocess import Popen, PIPE
cmd = [ 'python3', '-c', f'from utils import awsPostForm; print(awsPostForm("{file_path}"))' ]
output = Popen( cmd, stdout=PIPE ).communicate()[0]
return json.loads(output.decode('utf-8').replace('\n', '').replace("'", '"'))
The problem happens regardless of calling awsSign() one or many times for a list of files.
In short, I wish for a better way of retrieving subsequent post forms from generate_presigned_url in the same python session, without increasing the policy on every new request. If there is a proper way to restart the boto3 connection, provide some parameters that I missed when setting the API calls or maybe it's something particular to the Linode's S3 object storage service.
If anyone can point me at the right direction I'll appreciate!

Well, turns out it was a rookie mistake - got the hint from the linode's Q&A. So, answering my own question:
turns out the AWS_BUCKET_PARAMS variable was being updated by reference after passing through generate_presigned_post. Copying the global variable inside the function's scope before sending the request solved the issue.

Related

Cloud Run: endpoint that runs a function as background job

I am trying to deploy a rest api in cloud run where one endpoint launches an async job. The job is defined inside a function in the code.
It seems one way to do it is to use Cloud Task, but this would mean to make a self-call to another endpoint of the deployed api. Specifically, to create an auxiliary endpoint that contains the job code (e.g. /run-my-function) and another one to set the queue to cloud task that launches the /run-my-function?
Is this the right way to do it or I have misunderstand something? In case it's the right way how to specify the url of the /run-my-function endpoint without explicitly hard-code the cloud run deployed uRL name?
The code for the endpoint that launches the endpoint with the run-my-function code would be:
from google.cloud import tasks_v2
client = tasks_v2.CloudTasksClient()
project = 'myproject'
queue = 'myqueue'
location = 'mylocation'
url = 'https://cloudrunservice-abcdefg-ca.b.run.app/run-my-function'
service_account_email = '12345#cloudbuild.gserviceaccount.com'
parent = client.queue_path(project, location, queue)
task = {
"http_request": {
"http_method": tasks_v2.HttpMethod.POST,
'url': url,
"oidc_token": {"service_account_email": service_account_email},
}
}
response = client.create_task(parent=parent, task=task)
However, this requires to hard-code the service name https://cloudrunservice-abcdefg-ca.b.run.app and to define an auxiliary endpoint /run-my-function that can be called via http

In your code you are able to get the Cloud Run URL without hardcoding it or setting it in an environment variable.
You can have a look to a previous article that I wrote, in the gracefull termison part. I provide a working code in Go, not so difficult to re-implement in Python.
Here the principle:
Get the Region and the project Number from the Metadata server. Keep in mind that Cloud Run has specific metadata like the region
Get the K_SERVICE env var (it's a standard Cloud Run env var)
Perform a call to the Cloud Run Rest API to get the service detail and customize the request with the data got previously
Extract the status.url JSON entry from the response.
Now you have it!
Let me know if you have difficulties to achieve that. I'm not good at Python, but I will be able to write that piece of code!

How to make SDK connection to an S3 bucket persistent?

I'm using the boto3 library to put objects in Amazon S3. I want to make a python service on my server, which is connected to the bucket in AWS, and whenever I send it a file path, it puts that in the bucket:
s3_resource = boto3.resource(
's3',
endpoint_url='...',
aws_access_key_id='...',
aws_secret_access_key='...'
)
bucket = s3_resource.Bucket('name')
for uploading, I send my requests to this method:
def upload(path):
bucket.put_object(...)
The connection to the bucket should be persistent so that whenever I call upload method, it quickly puts the object in the bucket and does not need to connect to the bucket every time.
How can I enable long-lived connections on my s3_resource?

Edit
The SDK tries to be an abstraction from the underlying API calls.
Whenever you want to put an object into an S3 bucket, that results in an API call. API calls are sent over the network to AWS and that requires establishing a connection to an AWS server. This connection can be kept open for a longer time, so it doesn't need to be re-established every time you want to make an API call. This helps reduce the network overhead, since establishing connections is relatively costly.
From your perspective these should be implementation details, you shouldn't have to worry about, since the SDK (boto3) takes care of that for you. There are some options to tweak how it handles things, but these are considered advanced options and you should know what you're doing ;-)
The lifecycle of the resources in boto3 is more or less independent from the underlying network connection. The way you will see this impact you, is through higher latencies, when there is no pre-existing connection that can be repurposed.
What you're looking for are the keep-alive options in boto3.
There is two levels on which these can be enabled:
TCP
You can set the tcp_keepalive option in the SDK config, which is set to false by default.
More detail on that can be found in the documentation.
HTTP
For HTTP-Keep alive, there is nothing you need to do explicitly - the underlying library handles that implicitly. There is a common optimization suggestion when using aws-sdk-js to mess with this, but the SDKs behave differently, that's not necessary in Python. There is a long discussion about this in a Github issue.
If you want to set configure the setting explicitly, you can use the event system to do that as this reply suggests:
def set_connection_header(request, operation_name, **kwargs):
request.headers['Connection'] = 'keep-alive'
ddb = boto3.client('dynamodb')
ddb.meta.events.register('request-created.dynamodb', set_connection_header)

Using Boto3 to get AWS configuration option

I am looking for a way to perform the equivalent of the AWS CLI's method aws configure get varname [--profile profile-name] using boto3 in python. Does anyone know if this possible without either:
Parsing the AWS config file myself
Somehow interacting with the AWS CLI itself from my python script
For more context, I am writing a python cli tool that will interact with AWS APIs using boto3. The python tool uses an AWS session token stored in a profile in the ~/.aws/credentials file. I am using the saml2aws cli to fetch AWS credentials from my company's identity provider, which writes the aws_access_key_id, aws_secret_access_key, aws_session_token, aws_security_token, x_principal_arn, and x_security_token_expires parameters to the ~/.aws/credentials file like so:
[saml]
aws_access_key_id = #REMOVED#
aws_secret_access_key = #REMOVED#
aws_session_token = #REMOVED#
aws_security_token = #REMOVED#
x_principal_arn = arn:aws:sts::000000000123:assumed-role/MyAssumedRole
x_security_token_expires = 2019-08-19T15:00:56-06:00
By the nature of my python cli tool, sometimes the tool will execute past the expiration time of the AWS session token, which are enforced to be quite short by my company. I want the python cli tool to check the expiration time before it starts its critical task to verify that it has enough time to complete its task, and if not, alerting the user to refresh their session token.
Using the AWS CLI, I can fetch the expiration time of the AWS session token from the ~/.aws/credentials file using like this:
$ aws configure get x_security_token_expires --profile saml
2019-08-19T15:00:56-06:00
and I am curious if boto3 has a mechanism I was unable to find to do something similar.
As an alternate solution, given an already generated AWS session token, is it possible to fetch the expiration time of it? However, given the lack of answers on questions such as Ways to find out how soon the AWS session expires?, I would guess not.

Since the official AWS CLI is powered by boto3, I was able to dig into the source to find out how aws configure get is implemented. It's possible to read the profile configuration through the botocore Session object. Here is some code to get the config profile and value used in your example:
import botocore.session
# Create an empty botocore session directly
session = botocore.session.Session()
# Get config of desired profile. full_config is a standard python dictionary.
profiles_config = session.full_config.get("profiles", {})
saml_config = profiles_config.get("saml", {})
# Get config value. This will be None if the setting doesn't exist.
saml_security_token_expires = saml_config.get("x_security_token_expires")
I'm using code similar to the above as part of a transparent session cache. It checks for a profile's role_arn so I can identify a cached session to load if one exists and hasn't expired.
As far as the alternate question of knowing how long a given session has before expiring, you are correct in that there is currently no API call that can tell you this. Session expiration is only given when the session is created, either through STS get_session_token or assume_role API calls. You have to hold onto the expiration info yourself after that.

Boto3 - How to keep session alive

I have a process that is supposed to run forever and needs to updates data on a S3 bucket on AWS. I am initializing the session using boto3:
session = boto3.session.Session()
my_s3 = session.resource(
"s3",
region_name=my_region_name,
aws_access_key_id=my_aws_access_key_id,
aws_secret_access_key=my_aws_secret_access_key,
aws_session_token=my_aws_session_token,
)
Since the process is supposed to run for days, I am wondering how I can make sure that the session is kept alive and working. Do I need to re-initialize the session sometimes?
Note: not sure if it is useful, but I have actually multiple threads each using its own session.
Thanks!

There is no concept of a 'session'. The Session() is simply an in-memory object that contains information about how to connect to AWS.
It does not actually involve any calls to AWS until an action is performed (eg ListBuckets). Actions are RESTful, and return results immediately. They do not keep open a connection.
A Session is not normally required. If you have stored your AWS credentials in a config file using the AWS CLI aws configure command, you can simply use:
import boto3
s3_resource = boto3.resource('s3')
The Session, however, is useful if you are using temporary credentials returned by an AssumeRole() command, rather than permanent credentials. In such a case, please note that credentials returned by AWS Security Token Service (STS) such as AssumeRole() have time limitations. This, however, is unrelated to the concept of a boto3 Session.

Upload content from a BIG CSV to CloudSQL using App Engine Python

I'm pretty new with Google App Engine.
What i need to do is to upload a pretty large CSV to CloudSQL.
I've got an HTML page that has a file upload module which when uploaded reaches the Blobstore.
After which i open the CSV with the Blob reader and execute each line to CloudSQL using cursor.execute("insert into table values"). The problem here is that i can only execute the HTTP request for a minute and not all the data gets inserted in that short a time. It also keeps the screen in a loading state throughout which i would like to avoid by making the code run in the back end if that's possible?
I also tried going the "LOAD DATA LOCAL INFILE" way.
"LOAD DATA LOCAL INFILE" works from my local machine when i'm connected to CloudSQL via the terminal. And its pretty quick.
How would i go about using this within App Engine?
Or is there a better way to import a large CSV into CloudSQL through the Blobstore or Google Cloud Storage directly after uploading the CSV from the HTML?
Also, is it possible to use Task Queues with Blob Store and then insert the data into CloudSQL on the backend?

I have used a similar approach for Datastore and not CloudSQL but the same approach can be applied to your scenario.
Setup a non-default module (previously backend, deprecated now) of your application
Send a http request which will trigger the module endpoint through a task queue (to avoid 60 second deadline)
Use mapreduce with CSV as input and do the operation on each line of csv within the map function (to avoid memory errors and resume pipeline from where it left in case of any errors during operation)
EDIT: Elaborating map reduce as per OP request, and also eliminating the use of taskqueue
Read the mapreduce basics from the docs found here
Download the dependency folders for mapreduce to work (simplejson, graphy, mapreduce)
Download this file to your project folder and save as "custom_input_reader.py"
Now copy the code below to your main_app.py file.
main_app.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from custom_input_reader import GoogleStorageLineInputReader
def testMapperFunc(row):
# do process with csv row
return
class TestGCSReaderPipeline(base_handler.PipelineBase):
def run(self):
yield mapreduce_pipeline.MapPipeline(
"gcs_csv_reader_job",
"main_app.testMapperFunc",
"custom_input_reader.GoogleStorageLineInputReader",
params={
"input_reader": {
"file_paths": ['/' + bucketname + '/' + filename]
}
})
Create a http handler which will initiate the map job
main_app.py
class BeginUpload(webapp2.RequestHandler):
# do whatever you want
upload_task = TestGCSReaderPipeline()
upload_task.start()
# do whatever you want
If you want to pass any parameters, add the parameter in "run" method and provide values when creating the pipeline object

You can try importing CSV data via cloud console:
https://cloud.google.com/sql/docs/import-export?hl=en#import-csv

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.