AWS lambda open pdf using PyPDF2 - python

i was trying to open a PDF using python library PyPDF2 in AWS Lambda
but its giving me access denied
Code
from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open('S3 FILE URL', 'rb'))
if pdf.isEncrypted:
pdf.decrypt('')
width = int(pdf.getPage(0).mediaBox.getWidth())
height = int(pdf.getPage(0).mediaBox.getHeight())
my bucket permission
Block all public access
Off
Block public access to buckets and objects granted through new access control lists (ACLs)
Off
Block public access to buckets and objects granted through any access control lists (ACLs)
Off
Block public access to buckets and objects granted through new public bucket or access point policies
Off
Block public and cross-account access to buckets and objects through any public bucket or access point policies
Off

You're skipping a step by trying to use open() to fetch a URL: open() can only action files on the local filesystem - https://docs.python.org/3/library/functions.html#open
You'll need to use urllib3/etc. to fetch the file from S3 first (assuming the bucket is also publicly-accessible, as Manish pointed out).
urllib3 usage suggestion: What's the best way to download file using urllib3
So combining the two:
pdf = PdfFileReader(open('S3 FILE URL', 'rb'))
becomes (something like)
import urllib3
def fetch_file(url, save_as):
http = urllib3.PoolManager()
r = http.request('GET', url, preload_content=False)
with open(save_as, 'wb') as out:
while True:
data = r.read(chunk_size)
if not data:
break
out.write(data)
r.release_conn()
if __name__ == "__main__":
pdf_filename = "my_pdf_from_s3.pdf"
fetch_file(s3_file_url, pdf_filename)
pdf = PdfFileReader(open(pdf_filename, 'rb'))

I believe you have to make changes in this section of your S3 bucket in the AWS console. I believe this should solve your issue.

Related

use boto3 session when opening s3 url

I see a lot of code that uses an S3 bucket url to open a file. I would like to use smart open to open a compressed file
session = boto3.Session(ID, pass)
file = open("s3://bucket/file.txt.gz", transport_params=dict(session=session), encoding="utf-8")
However, all the examples I see about smart open and other pulls from boto3 using a url never specify how to use the session when pulling the data from the url only when pushing data to a new bucket. Is there a way to use the url and the session without needing to create a client from my session and access the bucket and key?
As mentioned you just need to replace "wb" with "rb". I was mistaken that this didn't work
from smart_open import open
import boto3
url = 's3://bucket/your/keyz'
session = boto3.Session(aws_access_key_id,
aws_secret_access_key,
region_name)
with open(url, 'rb', transport_params={'client': session.client('s3')}) as fin:
file = fin.read()
print(file)

Move all files in s3 bucket from s3 account to another using boto3

I'm trying to move the contents of a bucket from account-a to a bucket in account-b which I already have the credentials for both of them.
Here's the code I'm currently using:
import boto3
SRC_AWS_KEY = 'src-key'
SRC_AWS_SECRET = 'src-secret'
DST_AWS_KEY = 'dst-key'
DST_AWS_SECRET = 'dst-secret'
srcSession = boto3.session.Session(
aws_access_key_id=SRC_AWS_KEY,
aws_secret_access_key=SRC_AWS_SECRET
)
dstSession = boto3.session.Session(
aws_access_key_id=DST_AWS_KEY,
aws_secret_access_key=DST_AWS_SECRET
)
copySource = {
'Bucket': 'src-bucket',
'Key': 'test-bulk-src'
}
srcS3 = srcSession.resource('s3')
dstS3 = dstSession.resource('s3')
dstS3.meta.client.copy(CopySource=copySource, Bucket='dst-bucket', Key='test-bulk-dst', SourceClient=srcS3.meta.client)
print('success')
The problem is that when I specify a file's name in the field Key followed by /file.csv it works really fine, but when I set it to copy the whole folder, as showed in the code, it fails and throws this exception:
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
What I need to do is to move the contents in one call, not by iterating through the contents of the src-folder, because this is time/money consuming, as I may have thousands of files to be moved.
There is no API call in Amazon S3 to copy folders. (Folders do not actually exist — the Key of each object includes its full path.)
You will need to iterate through each object and copy it.
The AWS CLI (written in Python) provides some higher-level commands that will do this iteration for you:
aws s3 cp --recursive s3://source-bucket/folder/ s3://destination-bucket/folder/
If the buckets are in different accounts, I would recommend:
Use a set of credentials for the destination account (avoids problems with object ownership)
Modify the bucket policy on the source bucket to permit access by the credentials from the destination account (avoids the need to use two sets of credentials)

Downloading a file from a requester pays bucket in amazon s3

I need to download some replay files from an API that has the files stored on an amazon s3 bucket, with requester pays enabled.
The problem is, I set up my amazon AWS account, created an AWSAccessKeyId and AWSSecretKey, but I still can't get to download a single file, since I'm getting an Access denied response.
I want to automate all this inside a Python script, so I've been trying to do this with the boto3 package. Also, I installed the Amazon AWS CLI, and set up my access ID and secret key.
The file I've been trying to download (I want to download multiple ones, but for now I'm trying with just one) is this: http://hotsapi.s3-website-eu-west-1.amazonaws.com/18e8b4df-6dad-e1f5-bfc7-48899e6e6a16.StormReplay
From what I've found so far on SO, I've tried something like this:
import boto3
import botocore
BUCKET_NAME = 'hotsapi' # replace with your bucket name
KEY = '18e8b4df-6dad-e1f5-bfc7-48899e6e6a16.StormReplay' # replace with your object key
s3 = boto3.resource('s3')
try:
s3.Bucket(BUCKET_NAME).download_file(KEY, 'test.StormReplay')
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
And this:
import boto3
s3_client = boto3.Session().client('s3')
response = s3_client.get_object(Bucket='hotsapi',
Key='18e8b4df-6dad-e1f5-bfc7-48899e6e6a16.StormReplay',
RequestPayer='requester')
response_content = response['Body'].read()
with open('./B01.StormReplay', 'wb') as file:
file.write(response_content)
But I still can't manage to download the file.
Any help is welcome! Thanks!

Can't download from AWS boto3 generated signed url

I'm trying to upload a file to an existing AWS s3 bucket, generate a public URL and use that URL (somewhere else) to download the file.
I'm closely following the example here:
import os
import boto3
import requests
import tempfile
s3 = boto3.client('s3')
with tempfile.NamedTemporaryFile(mode="w", delete=False) as outfile:
outfile.write("dummycontent")
file_name = outfile.name
with open(file_name, mode="r") as outfile:
s3.upload_file(outfile.name, "twistimages", "filekey")
os.unlink(file_name)
url = s3.generate_presigned_url(
ClientMethod='get_object',
Params={
'Bucket': 'twistimages',
'Key': 'filekey'
}
)
response = requests.get(url)
print(response)
I would expect to see a success return code (200) from the requests library.
Instead, stdout is: <Response [400]>
Also, if I navigate to the corresponding URL with a webbrowser, I get an XML file with an error code: InvalidRequest and an error message:
The authorization mechanism you have provided is not supported. Please
use AWS4-HMAC-SHA256.
How can I use boto3 to generate a public URL, which can easily be downloaded by any user by just navigating to the corresponding URL, without generating complex headers?
Why does the example code from the official documentation not work in my case?
AWS S3 still support legacy old v2 signature in US region(prior 2014). But under new AWS region, only AWS4-HMAC-SHA256(s3v4) are allowed.
To support this features , you must specify them explicitly in .aws/config file or during boto3.s3 resource/client instantiation. e.g.
# add this entry under ~/.aws/config
[default]
s3.signature_version = s3v4
[other profile]
s3.signature_version = s3v4
Or declare them explicitly
s3client = boto3.client('s3', config= boto3.session.Config(signature_version='s3v4'))
s3resource = boto3.resource('s3', config= boto3.session.Config(signature_version='s3v4'))
I solved the issue. I'm using an S3 bucket in the eu-central-1 region and after specifying the region in the config file, everything worked as expected and the script stdout was <Response [200]>.
The configuration file (~/.aws/config) now looks like:
[default]
region=eu-central-1

Python Boto3 AWS Multipart Upload Syntax

I am successfully authenticating with AWS and using the 'put_object' method on the Bucket object to upload a file. Now I want to use the multipart API to accomplish this for large files. I found the accepted answer in this question:
How to save S3 object to a file using boto3
But when trying to implement I am getting "unknown method" errors. What am I doing wrong? My code is below. Thanks!
## Get an AWS Session
self.awsSession = Session(aws_access_key_id=accessKey,
aws_secret_access_key=secretKey,
aws_session_token=session_token,
region_name=region_type)
...
# Upload the file to S3
s3 = self.awsSession.resource('s3')
s3.Bucket('prodbucket').put_object(Key=fileToUpload, Body=data) # WORKS
#s3.Bucket('prodbucket').upload_file(dataFileName, 'prodbucket', fileToUpload) # DOESNT WORK
#s3.upload_file(dataFileName, 'prodbucket', fileToUpload) # DOESNT WORK
The upload_file method has not been ported over to the bucket resource yet. For now you'll need to use the client object directly to do this:
client = self.awsSession.client('s3')
client.upload_file(...)
Libcloud S3 wrapper transparently handles all the splitting and uploading of the parts for you.
Use upload_object_via_stream method to do so:
from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
# Path to a very large file you want to upload
FILE_PATH = '/home/user/myfile.tar.gz'
cls = get_driver(Provider.S3)
driver = cls('api key', 'api secret key')
container = driver.get_container(container_name='my-backups-12345')
# This method blocks until all the parts have been uploaded.
extra = {'content_type': 'application/octet-stream'}
with open(FILE_PATH, 'rb') as iterator:
obj = driver.upload_object_via_stream(iterator=iterator,
container=container,
object_name='backup.tar.gz',
extra=extra)
For official documentation on S3 Multipart feature, refer to AWS Official Blog.

Categories

Resources