Change directory of xlsx file in s3 bucket using AWS Lambda - python

The goal of my code is to change the directory of a file every 24 hours (because every day a new one is created with another lambda function). I want to get the current file from my s3 bucket and write it to another directory in the same s3 bucket. Currently, this line of the code does not work: s3.put_object(Body=response, Bucket=bucket, Key=fileout) and I get this error: "errorMessage": "Parameter validation failed:\nInvalid type for parameter Body, "errorType": "ParamValidationError" What does the error mean and what is needed in order to be able to store the response in the history directory?
import boto3
import json
s3 = boto3.client('s3')
bucket = "some-bucket"
def lambda_handler(event, context):
file='latest/some_file.xlsx'
response = s3.get_object(Bucket=bucket, Key=file)
fileout = 'history/some_file.xlsx'
s3.put_object(Body=response, Bucket=bucket, Key=fileout)
return {
'statusCode': 200,
'body': json.dumps(data),
}

The response variable in your code stores more than just the actual xlsx file. You should get the body from the response and pass it to the put object method.
response = s3.get_object(Bucket=bucket, Key=file)['Body']

Related

How to retrieve AWS S3 objects URL using python

I need to write a lambda function that retrieves s3 object URL for object preview. I came across this solution, but I have a question about it. In my case, I would like to retrieve URL of any object in my s3 bucket, hence there is no Keyname.How can i retriece url of any future objects stored in my s3 bucket.
bucket_name = 'aaa'
aws_region = boto3.session.Session().region_name
object_key = 'aaa.png'
s3_url = f"https://{bucket_name}.s3.{aws_region}.amazonaws.com/{object_key}"
return {
'statusCode': 200,
'body': json.dumps({'s3_url': s3_url})
}
You have some examples here. But, what exactly would you like to do? What do you mean by future objects? You can put a creation event on your bucket that will trigger your lambda each time when a new object is uploaded into that bucket.
import boto3
def lambda_handler(event, context):
print(event)
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
s3 = boto3.client('s3')
obj = s3.get_object(
Bucket=bucket,
Key=key
)
print(obj['Body'].read().decode('utf-8'))

S3 Bucket Lifecycle Configuration with Boto3 Lambda Function and getting MalformedXML format error

I created lambda function. In the function, I scanned all the s3 buckets if the bucket name contains "log" word. Took those buckets and apply the lifecycle. If the retention older than 30 days delete all kind of files under the buckets that contains "log" word. But when I triggered the code I get this format error.I couldn't fix it. Is my code or format wrong ?
This is my python code:
import logging
import boto3
import datetime, os, json, boto3
from botocore.exceptions import ClientError
def lambda_handler(event, context):
s3 = boto3.client('s3')
response = s3.list_buckets()
# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
#print(f' {bucket["Name"]}')
if 'log' in bucket['Name']:
BUCKET = bucket["Name"]
print (BUCKET)
try:
policy_status = s3.put_bucket_lifecycle_configuration(
Bucket = BUCKET ,
LifecycleConfiguration={'Rules': [{'Expiration':{'Days': 30,'ExpiredObjectDeleteMarker': True},'Status': 'Enabled',} ]})
except ClientError as e:
print("Unable to apply bucket policy. \nReason:{0}".format(e))
The error
Existing buckets:
sample-log-case
sample-bucket-log-v1
template-log-v3-mv
Unable to apply bucket policy.
Reason:An error occurred (MalformedXML) when calling the PutBucketLifecycleConfiguration operation: The XML you provided was not well-formed or did not validate against our published schema
I took below example from boto3 site, I just changed parameters didnt touch the format but I also get the format issue for this too :
import boto3
client = boto3.client('s3')
response = client.put_bucket_lifecycle_configuration(
Bucket='bucket-sample',
LifecycleConfiguration={
'Rules': [
{
'Expiration': {
'Days': 3650,
},
'Status': 'Enabled'
},
],
},
)
print(response)
According to the documentation, the 'Filter' rule is required if the LifecycleRule does not contain a 'Prefix' element (the 'Prefix' element is no longer used). 'Filter', in turn, requires exactly one of 'Prefix', 'Tag' or 'And'.
Adding the following to your 'Rules' in LifecycleConfiguration will solve the problem:
'Filter': {'Prefix': ''}

Upload files to S3 from multipart/form-data in AWS Lambda (Python)

I'm calling this Lambda function via API Gateway. My issue is that the image file is malformed, meaning that it does not open.
import boto3
import json
def lambda_handler(event, context):
print(event)
# removing all the data around the packet
# this also results in a malformed png
start = '<?xpacket end="r"?>'
end = '\r\n------'
content = str(event['body'])
content = content[content.index(start) + len(start):content.index(end)].encode('utf-8')
bucket_name = "bucket-name"
file_name = "hello1.png"
lambda_path = "/tmp/" + file_name
s3_path = file_name
s3 = boto3.resource("s3")
s3.Bucket(bucket_name).put_object(Key=s3_path, Body=content)
return {
'statusCode': 200,
'headers': {
'Access-Control-Allow-Origin': '*',
},
'body': json.dumps(event)
}
Lambda has payload limit of 6mb for synchronous call and 256KB for async call.
https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
Also api gateway has limit of 10MB for RESTfull APIs and 128KB for socket message
https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html
It may be the primary reason why some part of file is uploaded, some not.
Even if you do not hit those limits with smaller file sizes, you pay for lambda execution while uploading. It is just a waste of lambda's time.
Also there may be config on api gateway to modify payload while pushing it to lambda. Make sure there is no active template which would convert the request before hitting lambda and check if use lambda as proxy is checked at gateway-api dashboard for this resource.
To upload to S3 better use Pre-Signed URL for an Amazon S3 PUT Operation:
https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/s3-example-presigned-urls.html

How can I serve a .zip through an API Gateway with a Python Lambda?

Context: I have some log files in an S3 bucket that I need to retrieve. The permissions on the bucket prevent me from downloading them directly from the S3 bucket console. I need a "backdoor" approach to retrieve the files. I have an API Gateway setup to hit a Lambda, which will figure out what files to retrieve and get them from the S3 bucket. However, the files are over 10 MB and the AWS API Gateway has a maximum payload size of 10 MB. Now, I need a way to compress the files and serve them to the client as a downloadable zip.
import json
import boto3
import zipfile
import zlib
import os
S3 = boto3.resource('s3')
BUCKET = S3.Bucket(name="my-bucket")
TEN_MEGA_BYTES = 1000000000
def lambda_handler(event, context):
# utilize Lambda's temporary storage (512 MB)
retrieved = zipfile.ZipFile("/tmp/retrieved.zip", mode="w", compression=zipfile.ZIP_DEFLATED, compresslevel=9)
for bucket_obj in BUCKET.objects.all():
# logic to decide which file I want is done here
log_file_obj = BUCKET.Object(bucket_obj.key).get()
# get the object's binary encoded content (a bytes object)
content = log_file_obj["Body"].read()
# write the content to a file within the zip
# writestr() requires a bytes or str object
retrieved.writestr(bucket_obj.key, content)
# close the zip
retrieved.close()
# visually checking zip size for debugging
zip_size = os.path.getsize("/tmp/retrieved.zip")
print("{} bytes".format(zip_size), "{} percent of 10 MB".format(zip_size / TEN_MEGA_BYTES * 100))
return {
"header": {
"contentType": "application/zip, application/octet-stream",
"contentDisposition": "attachment, filename=retrieved.zip",
"contentEncoding": "deflate"
},
"body": retrieved
}
# return retrieved
I have tried returning the zipfile object directly and within a JSON structure with headers that are supposed to be mapped from the integration response to the method response (i.e. the headers I'm setting programmatically in the Lambda should be mapped to the response headers that the client actually receives). In either case, I get a marshal error.
Response:
{
"errorMessage": "Unable to marshal response: <zipfile.ZipFile [closed]> is not JSON serializable",
"errorType": "Runtime.MarshalError"
}
I have done a lot of tinkering in the API Gateway in the AWS Console trying to set different combinations of headers and/or content types, but I am stuck. I'm not entirely sure what would be the correct header/content-type combination.
From the above error message, it appears like Lambda can only return JSON structures but I am unable to confirm either way.
I got to the point of being able to return a compressed JSON payload, not a downloadable .zip file, but that is good enough for the time being. My team and I are looking into requesting a time-limited pre-signed URL for the S3 bucket that will allow us to bypass the permissions for a limited time to access the files. Even with the compression, we expect to someday reach the point where even the compressed payload is too large for the API Gateway to handle. We also discovered that Lambda has a smaller payload limit than even the API Gateway, Lambda Limits, 6 MB (synchronous).
Solution:
This article was a huge help, Lambda Compression. The main thing that was missing was using "Lambda Proxy Integration." When configuring the API Gateway Integration Request, choose "Use Lambda Proxy Integration." That will limit your ability to use any mapping templates on your method request and will mean that you have to return a specific structure from your Lambda function. Also, ensure content encoding is enabled in the settings of the API Gateway and with 'application/json' specified as an acceptable Binary Media Type.
Then, when sending the request, use these headers:
Accept-Encoding: application/gzip,
Accept: application/json
import json
import boto3
import gzip
import base64
from io import BytesIO
S3 = boto3.resource('s3')
BUCKET = S3.Bucket(name="my-bucket")
def gzip_b64encode(data):
compressed = BytesIO()
with gzip.GzipFile(fileobj=compressed, mode='w') as f:
json_response = json.dumps(data)
f.write(json_response.encode('utf-8'))
return base64.b64encode(compressed.getvalue()).decode('ascii')
def lambda_handler(event, context):
# event body is JSON string, this is request body
req_body = json.loads(event["body"])
retrieved_data = {}
for bucket_obj in BUCKET.objects.all():
# logic to decide what file to grab based on request body done here
log_file_obj = BUCKET.Object(bucket_obj.key).get()
content = log_file_obj["Body"].read().decode("UTF-8")
retrieved_data[bucket_obj.key] = json.loads(content)
# integration response format
return {
"isBase64Encoded": True,
"statusCode": 200,
"headers": {
"Content-Type": "application/json",
"Content-Encoding": "gzip",
"Access-Control-Allow-Origin": "*"
},
"body": gzip_b64encode(retrieved_data)
}

Complete a multipart_upload with boto3?

Tried this:
import boto3
from boto3.s3.transfer import TransferConfig, S3Transfer
path = "/temp/"
fileName = "bigFile.gz" # this happens to be a 5.9 Gig file
client = boto3.client('s3', region)
config = TransferConfig(
multipart_threshold=4*1024, # number of bytes
max_concurrency=10,
num_download_attempts=10,
)
transfer = S3Transfer(client, config)
transfer.upload_file(path+fileName, 'bucket', 'key')
Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.
I found this example, but part is not defined.
import boto3
bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'
s3 = boto3.client('s3')
# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
part1 = s3.upload_part(Bucket=bucket
, Key=key
, PartNumber=1
, UploadId=mpu['UploadId']
, Body=data)
# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
'Parts': [
{
'PartNumber': 1,
'ETag': part['ETag']
}
]
}
# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
, Key=key
, UploadId=mpu['UploadId']
, MultipartUpload=part_info)
Question: Does anyone know how to use the multipart upload with boto3?
Your code was already correct. Indeed, a minimal example of a multipart upload just looks like this:
import boto3
s3 = boto3.client('s3')
s3.upload_file('my_big_local_file.txt', 'some_bucket', 'some_key')
You don't need to explicitly ask for a multipart upload, or use any of the lower-level functions in boto3 that relate to multipart uploads. Just call upload_file, and boto3 will automatically use a multipart upload if your file size is above a certain threshold (which defaults to 8MB).
You seem to have been confused by the fact that the end result in S3 wasn't visibly made up of multiple parts:
Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.
... but this is the expected outcome. The whole point of the multipart upload API is to let you upload a single file over multiple HTTP requests and end up with a single object in S3.
As described in official boto3 documentation:
The AWS SDK for Python automatically manages retries and multipart and
non-multipart transfers.
The management operations are performed by using reasonable default
settings that are well-suited for most scenarios.
So all you need to do is just to set the desired multipart threshold value that will indicate the minimum file size for which the multipart upload will be automatically handled by Python SDK:
import boto3
from boto3.s3.transfer import TransferConfig
# Set the desired multipart threshold value (5GB)
GB = 1024 ** 3
config = TransferConfig(multipart_threshold=5*GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Moreover, you can also use multithreading mechanism for multipart upload by setting max_concurrency:
# To consume less downstream bandwidth, decrease the maximum concurrency
config = TransferConfig(max_concurrency=5)
# Download an S3 object
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
And finally in case you want perform multipart upload in single thread just set use_threads=False:
# Disable thread use/transfer concurrency
config = TransferConfig(use_threads=False)
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
Complete source code with explanation: Python S3 Multipart File Upload with Metadata and Progress Indicator
I would advise you to use boto3.s3.transfer for this purpose. Here is an example:
import boto3
def upload_file(filename):
session = boto3.Session()
s3_client = session.client("s3")
try:
print("Uploading file: {}".format(filename))
tc = boto3.s3.transfer.TransferConfig()
t = boto3.s3.transfer.S3Transfer(client=s3_client, config=tc)
t.upload_file(filename, "my-bucket-name", "name-in-s3.dat")
except Exception as e:
print("Error uploading: {}".format(e))
In your code snippet, clearly should be part -> part1 in the dictionary. Typically, you would have several parts (otherwise why use multi-part upload), and the 'Parts' list would contain an element for each part.
You may also be interested in the new pythonic interface to dealing with S3: http://s3fs.readthedocs.org/en/latest/
Why not use just the copy option in boto3?
s3.copy(CopySource={
'Bucket': sourceBucket,
'Key': sourceKey},
Bucket=targetBucket,
Key=targetKey,
ExtraArgs={'ACL': 'bucket-owner-full-control'})
There are details on how to initialise s3 object and obviously further options for the call available here boto3 docs.
copy from boto3 is a managed transfer which will perform a multipart copy in multiple threads if necessary.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.copy
This works with objects greater than 5Gb and I have already tested this.
Change Part to Part1
import boto3
bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'
s3 = boto3.client('s3')
# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
part1 = s3.upload_part(Bucket=bucket
, Key=key
, PartNumber=1
, UploadId=mpu['UploadId']
, Body=data)
# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
'Parts': [
{
'PartNumber': 1,
'ETag': part1['ETag']
}
]
}
# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
, Key=key
, UploadId=mpu['UploadId']
, MultipartUpload=part_info)

Categories

Resources