Tried this:
import boto3
from boto3.s3.transfer import TransferConfig, S3Transfer
path = "/temp/"
fileName = "bigFile.gz" # this happens to be a 5.9 Gig file
client = boto3.client('s3', region)
config = TransferConfig(
multipart_threshold=4*1024, # number of bytes
max_concurrency=10,
num_download_attempts=10,
)
transfer = S3Transfer(client, config)
transfer.upload_file(path+fileName, 'bucket', 'key')
Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.
I found this example, but part is not defined.
import boto3
bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'
s3 = boto3.client('s3')
# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
part1 = s3.upload_part(Bucket=bucket
, Key=key
, PartNumber=1
, UploadId=mpu['UploadId']
, Body=data)
# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
'Parts': [
{
'PartNumber': 1,
'ETag': part['ETag']
}
]
}
# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
, Key=key
, UploadId=mpu['UploadId']
, MultipartUpload=part_info)
Question: Does anyone know how to use the multipart upload with boto3?
Your code was already correct. Indeed, a minimal example of a multipart upload just looks like this:
import boto3
s3 = boto3.client('s3')
s3.upload_file('my_big_local_file.txt', 'some_bucket', 'some_key')
You don't need to explicitly ask for a multipart upload, or use any of the lower-level functions in boto3 that relate to multipart uploads. Just call upload_file, and boto3 will automatically use a multipart upload if your file size is above a certain threshold (which defaults to 8MB).
You seem to have been confused by the fact that the end result in S3 wasn't visibly made up of multiple parts:
Result: 5.9 gig file on s3. Doesn't seem to contain multiple parts.
... but this is the expected outcome. The whole point of the multipart upload API is to let you upload a single file over multiple HTTP requests and end up with a single object in S3.
As described in official boto3 documentation:
The AWS SDK for Python automatically manages retries and multipart and
non-multipart transfers.
The management operations are performed by using reasonable default
settings that are well-suited for most scenarios.
So all you need to do is just to set the desired multipart threshold value that will indicate the minimum file size for which the multipart upload will be automatically handled by Python SDK:
import boto3
from boto3.s3.transfer import TransferConfig
# Set the desired multipart threshold value (5GB)
GB = 1024 ** 3
config = TransferConfig(multipart_threshold=5*GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Moreover, you can also use multithreading mechanism for multipart upload by setting max_concurrency:
# To consume less downstream bandwidth, decrease the maximum concurrency
config = TransferConfig(max_concurrency=5)
# Download an S3 object
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
And finally in case you want perform multipart upload in single thread just set use_threads=False:
# Disable thread use/transfer concurrency
config = TransferConfig(use_threads=False)
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME', Config=config)
Complete source code with explanation: Python S3 Multipart File Upload with Metadata and Progress Indicator
I would advise you to use boto3.s3.transfer for this purpose. Here is an example:
import boto3
def upload_file(filename):
session = boto3.Session()
s3_client = session.client("s3")
try:
print("Uploading file: {}".format(filename))
tc = boto3.s3.transfer.TransferConfig()
t = boto3.s3.transfer.S3Transfer(client=s3_client, config=tc)
t.upload_file(filename, "my-bucket-name", "name-in-s3.dat")
except Exception as e:
print("Error uploading: {}".format(e))
In your code snippet, clearly should be part -> part1 in the dictionary. Typically, you would have several parts (otherwise why use multi-part upload), and the 'Parts' list would contain an element for each part.
You may also be interested in the new pythonic interface to dealing with S3: http://s3fs.readthedocs.org/en/latest/
Why not use just the copy option in boto3?
s3.copy(CopySource={
'Bucket': sourceBucket,
'Key': sourceKey},
Bucket=targetBucket,
Key=targetKey,
ExtraArgs={'ACL': 'bucket-owner-full-control'})
There are details on how to initialise s3 object and obviously further options for the call available here boto3 docs.
copy from boto3 is a managed transfer which will perform a multipart copy in multiple threads if necessary.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.copy
This works with objects greater than 5Gb and I have already tested this.
Change Part to Part1
import boto3
bucket = 'bucket'
path = "/temp/"
fileName = "bigFile.gz"
key = 'key'
s3 = boto3.client('s3')
# Initiate the multipart upload and send the part(s)
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
with open(path+fileName,'rb') as data:
part1 = s3.upload_part(Bucket=bucket
, Key=key
, PartNumber=1
, UploadId=mpu['UploadId']
, Body=data)
# Next, we need to gather information about each part to complete
# the upload. Needed are the part number and ETag.
part_info = {
'Parts': [
{
'PartNumber': 1,
'ETag': part1['ETag']
}
]
}
# Now the upload works!
s3.complete_multipart_upload(Bucket=bucket
, Key=key
, UploadId=mpu['UploadId']
, MultipartUpload=part_info)
Related
I'm calling this Lambda function via API Gateway. My issue is that the image file is malformed, meaning that it does not open.
import boto3
import json
def lambda_handler(event, context):
print(event)
# removing all the data around the packet
# this also results in a malformed png
start = '<?xpacket end="r"?>'
end = '\r\n------'
content = str(event['body'])
content = content[content.index(start) + len(start):content.index(end)].encode('utf-8')
bucket_name = "bucket-name"
file_name = "hello1.png"
lambda_path = "/tmp/" + file_name
s3_path = file_name
s3 = boto3.resource("s3")
s3.Bucket(bucket_name).put_object(Key=s3_path, Body=content)
return {
'statusCode': 200,
'headers': {
'Access-Control-Allow-Origin': '*',
},
'body': json.dumps(event)
}
Lambda has payload limit of 6mb for synchronous call and 256KB for async call.
https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
Also api gateway has limit of 10MB for RESTfull APIs and 128KB for socket message
https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html
It may be the primary reason why some part of file is uploaded, some not.
Even if you do not hit those limits with smaller file sizes, you pay for lambda execution while uploading. It is just a waste of lambda's time.
Also there may be config on api gateway to modify payload while pushing it to lambda. Make sure there is no active template which would convert the request before hitting lambda and check if use lambda as proxy is checked at gateway-api dashboard for this resource.
To upload to S3 better use Pre-Signed URL for an Amazon S3 PUT Operation:
https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/s3-example-presigned-urls.html
Context: I have some log files in an S3 bucket that I need to retrieve. The permissions on the bucket prevent me from downloading them directly from the S3 bucket console. I need a "backdoor" approach to retrieve the files. I have an API Gateway setup to hit a Lambda, which will figure out what files to retrieve and get them from the S3 bucket. However, the files are over 10 MB and the AWS API Gateway has a maximum payload size of 10 MB. Now, I need a way to compress the files and serve them to the client as a downloadable zip.
import json
import boto3
import zipfile
import zlib
import os
S3 = boto3.resource('s3')
BUCKET = S3.Bucket(name="my-bucket")
TEN_MEGA_BYTES = 1000000000
def lambda_handler(event, context):
# utilize Lambda's temporary storage (512 MB)
retrieved = zipfile.ZipFile("/tmp/retrieved.zip", mode="w", compression=zipfile.ZIP_DEFLATED, compresslevel=9)
for bucket_obj in BUCKET.objects.all():
# logic to decide which file I want is done here
log_file_obj = BUCKET.Object(bucket_obj.key).get()
# get the object's binary encoded content (a bytes object)
content = log_file_obj["Body"].read()
# write the content to a file within the zip
# writestr() requires a bytes or str object
retrieved.writestr(bucket_obj.key, content)
# close the zip
retrieved.close()
# visually checking zip size for debugging
zip_size = os.path.getsize("/tmp/retrieved.zip")
print("{} bytes".format(zip_size), "{} percent of 10 MB".format(zip_size / TEN_MEGA_BYTES * 100))
return {
"header": {
"contentType": "application/zip, application/octet-stream",
"contentDisposition": "attachment, filename=retrieved.zip",
"contentEncoding": "deflate"
},
"body": retrieved
}
# return retrieved
I have tried returning the zipfile object directly and within a JSON structure with headers that are supposed to be mapped from the integration response to the method response (i.e. the headers I'm setting programmatically in the Lambda should be mapped to the response headers that the client actually receives). In either case, I get a marshal error.
Response:
{
"errorMessage": "Unable to marshal response: <zipfile.ZipFile [closed]> is not JSON serializable",
"errorType": "Runtime.MarshalError"
}
I have done a lot of tinkering in the API Gateway in the AWS Console trying to set different combinations of headers and/or content types, but I am stuck. I'm not entirely sure what would be the correct header/content-type combination.
From the above error message, it appears like Lambda can only return JSON structures but I am unable to confirm either way.
I got to the point of being able to return a compressed JSON payload, not a downloadable .zip file, but that is good enough for the time being. My team and I are looking into requesting a time-limited pre-signed URL for the S3 bucket that will allow us to bypass the permissions for a limited time to access the files. Even with the compression, we expect to someday reach the point where even the compressed payload is too large for the API Gateway to handle. We also discovered that Lambda has a smaller payload limit than even the API Gateway, Lambda Limits, 6 MB (synchronous).
Solution:
This article was a huge help, Lambda Compression. The main thing that was missing was using "Lambda Proxy Integration." When configuring the API Gateway Integration Request, choose "Use Lambda Proxy Integration." That will limit your ability to use any mapping templates on your method request and will mean that you have to return a specific structure from your Lambda function. Also, ensure content encoding is enabled in the settings of the API Gateway and with 'application/json' specified as an acceptable Binary Media Type.
Then, when sending the request, use these headers:
Accept-Encoding: application/gzip,
Accept: application/json
import json
import boto3
import gzip
import base64
from io import BytesIO
S3 = boto3.resource('s3')
BUCKET = S3.Bucket(name="my-bucket")
def gzip_b64encode(data):
compressed = BytesIO()
with gzip.GzipFile(fileobj=compressed, mode='w') as f:
json_response = json.dumps(data)
f.write(json_response.encode('utf-8'))
return base64.b64encode(compressed.getvalue()).decode('ascii')
def lambda_handler(event, context):
# event body is JSON string, this is request body
req_body = json.loads(event["body"])
retrieved_data = {}
for bucket_obj in BUCKET.objects.all():
# logic to decide what file to grab based on request body done here
log_file_obj = BUCKET.Object(bucket_obj.key).get()
content = log_file_obj["Body"].read().decode("UTF-8")
retrieved_data[bucket_obj.key] = json.loads(content)
# integration response format
return {
"isBase64Encoded": True,
"statusCode": 200,
"headers": {
"Content-Type": "application/json",
"Content-Encoding": "gzip",
"Access-Control-Allow-Origin": "*"
},
"body": gzip_b64encode(retrieved_data)
}
I want to update the Content-Type of an existing object in a S3 bucket, using boto3, but how do I do that, without having to re-upload the file?
file_object = s3.Object(bucket_name, key)
print file_object.content_type
# binary/octet-stream
file_object.content_type = 'application/pdf'
# AttributeError: can't set attribute
Is there a method for this I have missed in boto3?
Related questions:
How to set Content-Type on upload
How to set the content type of an S3 object via the SDK?
There doesn't seem to exist any method for this in boto3, but you can copy the file to overwrite itself.
To do this using the AWS low level API through boto3, do like this:
s3 = boto3.resource('s3')
api_client = s3.meta.client
response = api_client.copy_object(Bucket=bucket_name,
Key=key,
ContentType="application/pdf",
MetadataDirective="REPLACE",
CopySource=bucket_name + "/" + key)
The MetadataDirective="REPLACE" turns out to be required for S3 to overwrite the file, otherwise you will get an error message saying This copy request is illegal because it is trying to copy an object to itself without changing the object's metadata, storage class, website redirect location or encryption attributes.
.
Or you can use copy_from, as pointed out by Jordon Phillips in the comments:
s3 = boto3.resource("s3")
object = s3.Object(bucket_name, key)
object.copy_from(CopySource={'Bucket': bucket_name,
'Key': key},
MetadataDirective="REPLACE",
ContentType="application/pdf")
In addition to #leo's answer, be careful if you have custom metadata on your object.
To avoid side effects, I propose adding Metadata=object.metadata in the leo's code otherwise you could lose previous custom metadata:
s3 = boto3.resource("s3")
object = s3.Object(bucket_name, key)
object.copy_from(
CopySource={'Bucket': bucket_name, 'Key': key},
Metadata=object.metadata,
MetadataDirective="REPLACE",
ContentType="application/pdf"
)
You can use upload_file function from boto3 and use ExtraArgs param to specify the content type, this will overwrite the existing file with the content type, check out this reference
check this below example:
import boto3
import os
client = boto3.client("s3")
temp_file_path = "<path_of_your_file>"
client.upload_file(temp_ticket_path, <BUCKET_NAME>, temp_file_path, ExtraArgs={'ContentType': 'application/pdf'})
I am looking to copy files from gcs to my s3 bucket. In boto2, easy as a button.
conn = connect_gs(user_id, password)
gs_bucket = conn.get_bucket(gs_bucket_name)
for obj in bucket:
s3_key = key.Key(s3_bucket)
s3_key.key = obj
s3_key.set_contents_from_filename(obj)
However in boto3, I am lost trying to find equivalent code. Any takers?
If all you're doing is a copy:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
for obj in gcs:
s3_obj = bucket.Object(gcs.key)
s3_obj.put(Body=gcs.data)
Docs: s3.Bucket, s3.Bucket.Object, s3.Bucket.Object.put
Alternatively, if you don't want to use the resource model:
import boto3
s3_client = boto3.client('s3')
for obj in gcs:
s3_client.put_object(Bucket='bucket-name', Key=gcs.key, Body=gcs.body)
Docs: s3_client.put_object
Caveat: The gcs bits are pseudocode, I am not familiar with their API.
EDIT:
So it seems gcs supports an old version of the S3 API and with that an old version of the signer. We still have support for that old signer, but you have to opt into it. Note that some regions don't support old signing versions (you can see a list of which S3 regions support which versions here), so if you're trying to copy over to one of those you will need to use a different client.
import boto3
from botocore.client import Config
# Create a client with the s3v2 signer
resource = boto3.resource('s3', config=Config(signature_version='s3'))
gcs_bucket = resource.Bucket('phjordon-test-bucket')
s3_bucket = resource.Bucket('phjordon-test-bucket-tokyo')
for obj in gcs_bucket.objects.all():
s3_bucket.Object(obj.key).copy_from(
CopySource=obj.bucket_name + "/" + obj.key)
Docs: s3.Object.copy_from
This, of course, will only work assuming gcs is still S3 compliant.
I'm going to write a Python program to check if a file is in certain folder of my Google Cloud Storage, the basic idea is to get the list of all objects in a folder, a file name list, then check if the file abc.txt is in the file name list.
Now the problem is, it looks Google only provide the one way to get obj list, which is uri.get_bucket(), see below code which is from https://developers.google.com/storage/docs/gspythonlibrary#listing-objects
uri = boto.storage_uri(DOGS_BUCKET, GOOGLE_STORAGE)
for obj in uri.get_bucket():
print '%s://%s/%s' % (uri.scheme, uri.bucket_name, obj.name)
print ' "%s"' % obj.get_contents_as_string()
The defect of uri.get_bucket() is, it looks it is getting all of the object first, this is what I don't want, I just need get the obj name list of particular folder(e.g gs//mybucket/abc/myfolder) , which should be much quickly.
Could someone help answer? Appreciate every answer!
Update: the below is true for the older "Google API Client Libraries" for Python, but if you're not using that client, prefer the newer "Google Cloud Client Library" for Python ( https://googleapis.dev/python/storage/latest/index.html ). For the newer library, the equivalent to the below code is:
from google.cloud import storage
client = storage.Client()
for blob in client.list_blobs('bucketname', prefix='abc/myfolder'):
print(str(blob))
Answer for older client follows.
You may find it easier to work with the JSON API, which has a full-featured Python client. It has a function for listing objects that takes a prefix parameter, which you could use to check for a certain directory and its children in this manner:
from apiclient import discovery
# Auth goes here if necessary. Create authorized http object...
client = discovery.build('storage', 'v1') # add http=whatever param if auth
request = client.objects().list(
bucket="mybucket",
prefix="abc/myfolder")
while request is not None:
response = request.execute()
print json.dumps(response, indent=2)
request = request.list_next(request, response)
Fuller documentation of the list call is here: https://developers.google.com/storage/docs/json_api/v1/objects/list
And the Google Python API client is documented here:
https://code.google.com/p/google-api-python-client/
This worked for me:
client = storage.Client()
BUCKET_NAME = 'DEMO_BUCKET'
bucket = client.get_bucket(BUCKET_NAME)
blobs = bucket.list_blobs()
for blob in blobs:
print(blob.name)
The list_blobs() method will return an iterator used to find blobs in the bucket.
Now you can iterate over blobs and access every object in the bucket. In this example I just print out the name of the object.
This documentation helped me alot:
https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html
https://googleapis.github.io/google-cloud-python/latest/_modules/google/cloud/storage/client.html#Client.bucket
I hope I could help!
You might also want to look at gcloud-python and documentation.
from gcloud import storage
connection = storage.get_connection(project_name, email, private_key_path)
bucket = connection.get_bucket('my-bucket')
for key in bucket:
if key.name == 'abc.txt':
print 'Found it!'
break
However, you might be better off just checking if the file exists:
if 'abc.txt' in bucket:
print 'Found it!'
Install python package google-cloud-storage by pip or pycharm and use below code
from google.cloud import storage
client = storage.Client()
for blob in client.list_blobs(BUCKET_NAME, prefix=FOLDER_NAME):
print(str(blob))
I know this is an old question, but I stumbled over this because I was looking for the exact same answer. Answers from Brandon Yarbrough and Abhijit worked for me, but I wanted to get into more detail.
When you run this:
from google.cloud import storage
storage_client = storage.Client()
blobs = list(storage_client.list_blobs(bucket_name, prefix=PREFIX, fields="items(name)"))
You will get Blob objects, with just the name field of all files in the given bucket, like this:
[<Blob: BUCKET_NAME, PREFIX, None>,
<Blob: xml-BUCKET_NAME, [PREFIX]claim_757325.json, None>,
<Blob: xml-BUCKET_NAME, [PREFIX]claim_757390.json, None>,
...]
If you are like me and you want to 1) filter out the first item in the list because it does NOT represent a file - its just the prefix, 2) just get the name string value, and 3) remove the PREFIX from the file name, you can do something like this:
blob_names = [blob_name.name[len(PREFIX):] for blob_name in blobs if blob_name.name != folder_name]
Complete code to get just the string files names from a storage bucket:
from google.cloud import storage
storage_client = storage.Client()
blobs = list(storage_client.list_blobs(bucket_name, prefix=PREFIX, fields="items(name)"))
blob_names = [blob_name.name[len(PREFIX):] for blob_name in blobs if blob_name.name != folder_name]
print(f"blob_names = {blob_names}")