Boto - S3 Response Error 403 Forbidden - python

I know there are a lot of questions on here about the same issue , however I have gone through each and every one of them and have tried out the suggestions and answers given there to no avail. Thats why I am posting this question here
I am trying to upload a file to my bucket. Since this file is larger than 100mb I am trying to upload it using the multipart_upload which boto supports. I was able to achieve that . Then i tried to increase upload speeds by using the pool class from multiprocessing module. I used the code given below . When I run the program, nothing happens. I used the from multiprocessing.dummy import pool for debugging purposes and the program raises an
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>55D423C42E8A9D94</RequestId><HostId>kxxX+UmBlGaT4X8adUAp9XQV/1jiiK83IZKQuKxAIMEmzdC3g9IRqDqIVXGLPAOe</HostId></Error>
at raise exc (marked with a '#') under _upload. I dont understand why i get this. I have full read and write access to the bucket and standard uploads work like a charm without any error. I can also delete any file I want from the bucket. The only issue seems to be when I try parallel uploads. Code is pasted below adn can also be found here
My code:(I have removed my keys and bucket name from the code)
def _upload_part(bucketname, aws_key, aws_secret, multipart_id, part_num,
source_path, offset, bytes, amount_of_retries=10):
"""
Uploads a part with retries.
"""
def _upload(retries_left=amount_of_retries):
try:
logging.info('Start uploading part #%d ...' % part_num)
conn = S3Connection(aws_key, aws_secret)
bucket = conn.get_bucket(bucketname)
for mp in bucket.get_all_multipart_uploads():
if mp.id == multipart_id:
with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
mp.upload_part_from_file(fp=fp, part_num=part_num)
break
except Exception, exc:
if retries_left:
_upload(retries_left=retries_left - 1)
else:
logging.info('... Failed uploading part #%d' % part_num)
raise exc #this line raises error
else:
logging.info('... Uploaded part #%d' % part_num)
_upload()
def upload(bucketname, aws_key, aws_secret, source_path, keyname,
acl='private', headers={}, parallel_processes=4):
"""
Parallel multipart upload.
"""
conn = S3Connection(aws_key, aws_secret)
bucket = conn.get_bucket(bucketname)
mp = bucket.initiate_multipart_upload(keyname, headers=headers)
source_size = os.stat(source_path).st_size
bytes_per_chunk = max(int(math.sqrt(5242880) * math.sqrt(source_size)),
5242880)
chunk_amount = int(math.ceil(source_size / float(bytes_per_chunk)))
pool = Pool(processes=parallel_processes)
for i in range(chunk_amount):
offset = i * bytes_per_chunk
remaining_bytes = source_size - offset
bytes = min([bytes_per_chunk, remaining_bytes])
part_num = i + 1
pool.apply_async(_upload_part, [bucketname, aws_key, aws_secret, mp.id,
part_num, source_path, offset, bytes])
pool.close()
pool.join()
if len(mp.get_all_parts()) == chunk_amount:
mp.complete_upload()
key = bucket.get_key(keyname)
key.set_acl(acl)
else:
mp.cancel_upload()
upload(default_bucket, acs_key, sec_key, '/path/to/folder/testfile.txt', 'testfile.txt')

Related

Multipart uploaded s3 file returns incorrect etag

I have a py function that does a multipart upload where the chunk size (max_file_size_MB) is set to 1MB:
def upload_file_to_aws(self, file_path, filename):
MB = 1024 ** 2
config = TransferConfig(multipart_threshold=self.max_file_size_MB * MB, max_concurrency=10,
multipart_chunksize=self.max_file_size_MB * MB, use_threads=True)
try:
self.client.upload_file(file_path, self.bucket_name, filename,
ExtraArgs={'ContentType': 'application/json'},
Config=config,
Callback=ProgressPercentage(file_path))
except ClientError as e:
logging.error("Error uploading file to AWS. %s", e)
return False
And I'm using the etag calculation algorithm from here
I have the following unit test to make sure the etag calc function works fine which uses a 1.6MB file
#mock_s3
def test_calculated_s3_etag_for_multipart_uploads_is_correct(self):
# Run on a file of 1.6MB
conn = boto3.resource('s3')
conn.create_bucket(Bucket=self.client.bucket_name)
calculated_etag = calculate_s3_etag(SNAPSHOT_DATA_LARGE_FILE)
self.client.upload_file_to_aws(SNAPSHOT_DATA_LARGE_FILE, METADATA_ID)
obj = self.client.get_object(self.client.bucket_name, METADATA_ID)
uploaded_file_etag = obj['ETag']
assert calculated_etag == uploaded_file_etag
but it fails with assert '"bd86ed92459190f807e59b60b6089674-2"' == u'"eaa8602b7154cab4e0787756b2fdc53f-1"'
The calculated etag looks alright but I'm not sure why the S3 etag looks like it was uploaded only in 1 chunk (when it should have been 2)
Any ideas what I might be doing wrong?
EDIT: So all of the above work fine for any size file as long as the chunk size is more than 5MB. If I want to transfer files in chunks less than 5MB I'm not sure what encoding technique is used (i can't use decoded md5 or the reversed engineered etag decoding)

How to write parquet file to ECS in Flask python using boto or boto3

I have flask python rest api which is called by another flask rest api.
the input for my api is one parquet file (FileStorage object) and ECS connection and bucket details.
I want to save parquet file to ECS in a specific folder using boto or boto3
the code I have tried
def uploadFileToGivenBucket(self,inputData,file):
BucketName = inputData.ecsbucketname
calling_format = OrdinaryCallingFormat()
client = S3Connection(inputData.access_key_id, inputData.secret_key, port=inputData.ecsport,
host=inputData.ecsEndpoint, debug=2,
calling_format=calling_format)
#client.upload_file(BucketName, inputData.filename, inputData.folderpath)
bucket = client.get_bucket(BucketName,validate=False)
key = boto.s3.key.Key(bucket, inputData.filename)
fileName = NamedTemporaryFile(delete=False,suffix=".parquet")
file.save(fileName)
with open(fileName.name) as f:
key.send_file(f)
but it is not working and giving me error like...
signature_host = '%s:%d' % (self.host, port)
TypeError: %d format: a number is required, not str
I tried google but no luck Can anyone help me with this or any sample code for the same.
After a lot of hit and tried and time, I finally got the solution. I posting it for everyone else who are facing the same issue.
You need to use Boto3 and here is the code...
def uploadFileToGivenBucket(self,inputData,file):
BucketName = inputData.ecsbucketname
#bucket = client.get_bucket(BucketName,validate=False)
f = NamedTemporaryFile(delete=False,suffix=".parquet")
file.save(f)
endpointurl = "<your endpoints>"
s3_client = boto3.client('s3',endpoint_url=endpointurl, aws_access_key_id=inputData.access_key_id,aws_secret_access_key=inputData.secret_key)
try:
newkey = 'yourfolderpath/anotherfolder'+inputData.filename
response = s3_client.upload_file(f.name, BucketName,newkey)
except ClientError as e:
logging.error(e)
return False
return True

Slow downloads with Google Storage Bucket

I am running into a strange issue downloading files from Google Storage Buckets.
If I am on Linux and run this code, a 64kb PDF file takes like 5 minutes to download.
def generate_document(request):
if not ensure_valid_user(request):
return redirect('/?result=0')
try:
long_name = request.GET['long_name']
short_name = request.GET['short_name']
file_data, size = CloudStorageManager.get_file(long_name)
response = HttpResponse(file_data, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename={}'.format(short_name)
response['Content-Length'] = size
return response
except Exception as ex:
print(ex)
Here is the method from CloudStorageManager class that is important:
class CloudStorageManager:
# private key file, used for local testing
storage_client = storage.Client.from_service_account_json(
'CloudStorageAPIKey.json')
bucket = storage_client.get_bucket("my.private.bucket")
#staticmethod
def get_file(long_name):
bucket = CloudStorageManager.bucket
blob = bucket.blob(long_name)
file_string = blob.download_as_string()
return file_string, blob.size
What I am lost on is that with Linux, if I comment out response['Content-Length'] = size from my generate_document() method the download occurs at normal speed, however when I go home and get on Windows with that line commented, the download takes 5 minutes again, and works with the line included.
Can someone help explain where I am going wrong?
Interestingly enough
I fixed the issue by assigning the Content-Length of my response from:
response['Content-Length'] = size
to
response['Content-Length'] = len(response.content)

Urllib2 Python - Reconnecting and Splitting Response

I am moving to Python from other language and I am not sure how to properly tackle this. Using the urllib2 library it is quite easy to set up a proxy and get a data from a site:
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
The problem I have is that the text file that is retrieved is very large (hundreds of MB) and the connection is often problematic. The code also need to catch connection, server and transfer errors (it will be a part of small extensively used pipeline).
Could anyone suggest how to modify the code above to make sure the code automatically reconnects n times (for example 100 times) and perhaps split the response into chunks so the data will be downloaded faster and more reliably?
I have already split the requests as much as I could so now have to make sure that the retrieve code is as good as it can be. Solutions based on core python libraries are ideal.
Perhaps the library is already doing the above in which case is there any way to improve downloading large files? I am using UNIX and need to deal with a proxy.
Thanks for your help.
I'm putting up an example of how you might want to do this with the python-requests library. The script below checks if the destinations file already exists. If the partially destination file exists, it's assumed to be the partially downloaded file, and tries to resume the download. If the server claims support for a HTTP Partial Request (i.e. the response to a HEAD request contains Accept-Range header), then the script resume based on the size of the partially downloaded file; otherwise it just does a regular download and discard the parts that are already downloaded. I think it should be fairly straight forward to convert this to use just urllib2 if you don't want to use python-requests, it'll probably be just much more verbose.
Note that resuming downloads may corrupt the file if the file on the server is modified between the initial download and the resume. This can be detected if the server supports strong HTTP ETag header so the downloader can check whether it's resuming the same file.
I make no claim that it is bug-free.
You should probably add a checksum logic around this script to detect download errors and retry from scratch if the checksum doesn't match.
import logging
import os
import re
import requests
CHUNK_SIZE = 5*1024 # 5KB
logging.basicConfig(level=logging.INFO)
def stream_download(input_iterator, output_stream):
for chunk in input_iterator:
output_stream.write(chunk)
def skip(input_iterator, output_stream, bytes_to_skip):
total_read = 0
while total_read <= bytes_to_skip:
chunk = next(input_iterator)
total_read += len(chunk)
output_stream.write(chunk[bytes_to_skip - total_read:])
assert total_read == output_stream.tell()
return input_iterator
def resume_with_range(url, output_stream):
dest_size = output_stream.tell()
headers = {'Range': 'bytes=%s-' % dest_size}
resp = requests.get(url, stream=True, headers=headers)
input_iterator = resp.iter_content(CHUNK_SIZE)
if resp.status_code != requests.codes.partial_content:
logging.warn('server does not agree to do partial request, skipping instead')
input_iterator = skip(input_iterator, output_stream, output_stream.tell())
return input_iterator
rng_unit, rng_start, rng_end, rng_size = re.match('(\w+) (\d+)-(\d+)/(\d+|\*)', resp.headers['Content-Range']).groups()
rng_start, rng_end, rng_size = map(int, [rng_start, rng_end, rng_size])
assert rng_start <= dest_size
if rng_start != dest_size:
logging.warn('server returned different Range than requested')
output_stream.seek(rng_start)
return input_iterator
def download(url, dest):
''' Download `url` to `dest`, resuming if `dest` already exists
If `dest` already exists it is assumed to be a partially
downloaded file for the url.
'''
output_stream = open(dest, 'ab+')
output_stream.seek(0, os.SEEK_END)
dest_size = output_stream.tell()
if dest_size == 0:
logging.info('STARTING download from %s to %s', url, dest)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
remote_headers = requests.head(url).headers
remote_size = int(remote_headers['Content-Length'])
if dest_size < remote_size:
logging.info('RESUMING download from %s to %s', url, dest)
support_range = 'bytes' in [s.strip() for s in remote_headers['Accept-Ranges'].split(',')]
if support_range:
logging.debug('server supports Range request')
logging.debug('downloading "Range: bytes=%s-"', dest_size)
input_iterator = resume_with_range(url, output_stream)
else:
logging.debug('skipping %s bytes', dest_size)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
input_iterator = skip(input_iterator, output_stream, bytes_to_skip=dest_size)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
logging.debug('NOTHING TO DO')
return
def main():
TEST_URL = 'http://mirror.internode.on.net/pub/test/1meg.test'
DEST = TEST_URL.split('/')[-1]
download(TEST_URL, DEST)
main()
You can try something like this. It reads the file line by line and appends it to a file. It also checks to make sure that you don't go over the same line. I'll write another script that does it by chunks as well.
import urllib2
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=20)
print("Connected")
with open("outfile.html", 'w+') as out_data:
for data in response.readlines():
file_checker = open("outfile.html")
if data not in file_checker.readlines():
out_data.write(str(data))
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")
Here's how to read the data in chunks instead of by lines
import urllib2
CHUNK = 16 * 1024
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=1)
print("Connected")
with open("outdata", 'wb+') as out_data:
while True:
chunk = response.read(CHUNK)
file_checker = open("outdata")
if chunk and chunk not in file_checker.readlines():
out_data.write(chunk)
else:
break
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")

S3ResponseError: 403 Forbidden using boto

I have a script that copy files from one S3 account to another S3 account, It was working befoure!!!! That's for sure. Than I tried it today and it doesn't any more it gives me error S3ResponseError: 403 Forbidden. I'm 100% sure credentials are correct and I can go and download keys from both accounts manualy using aws console.
Code
def run(self):
while True:
# Remove and return an item from the queue
key_name = self.q.get()
k = Key(self.s_bucket, key_name)
d_key = Key(self.d_bucket, k.key)
if not d_key.exists() or k.etag != d_key.etag:
print 'Moving {file_name} from {s_bucket} to {d_bucket}'.format(
file_name = k.key,
s_bucket = source_bucket,
d_bucket = dest_bucket
)
# Create a new key in the bucket by copying another existing key
acl = self.s_bucket.get_acl(k)
self.d_bucket.copy_key( d_key.key, self.s_bucket.name, k.key, storage_class=k.storage_class)
d_key.set_acl(acl)
else:
print 'File exist'
self.q.task_done()
Error:
File "s3_to_s3.py", line 88, in run
self.d_bucket.copy_key( d_key.key, self.s_bucket.name, k.key, storage_class=k.storage_class)
File "/usr/lib/python2.7/dist-packages/boto/s3/bucket.py", line 689, in copy_key
response.reason, body)
S3ResponseError: S3ResponseError: 403 Forbidden
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>0729E8ADBD7A9E60</RequestId><HostId>PSbbWCLBtLAC9cjW+52X1fUSVErnZeN79/w7rliDgNbLIdCpc9V0bPi8xO9fp1od</HostId></Error>
Try this: copy key from source bucket to destination bucket using boto's Key class
source_key_name = 'image.jpg' # for example
#return Key object
source_key = source_bucket.get_key(source_key_name)
#use Key.copy
source_key.copy(destination_bucket,source_key_name)
regarding the copy function. you can set preserve_acl to True and it will be copied from the source key.
Boto's Key.copy signature:
def copy(self, dst_bucket, dst_key, metadata=None,
reduced_redundancy=False, preserve_acl=False,
encrypt_key=False, validate_dst_bucket=True):

Categories

Resources