Use multiprocess for boto3 s3 upload_fileobj causes SSLError - python

In AWS Lambda with runtimes python3.9 and boto3-1.20.32, I run the following code,
s3_client = boto3.client(service_name="s3")
s3_bucket = "bucket"
s3_other_bucket = "other_bucket"
def multiprocess_s3upload(tar_index: dict):
def _upload(filename, bytes_range):
src_key = ...
# get single raw file in tar with bytes range
s3_obj = s3_client.get_object(
Bucket=s3_bucket,
Key=src_key,
Range=f"bytes={bytes_range}"
)
# upload raw file
# error occur !!!!!
s3_client.upload_fileobj(
s3_obj["Body"],
s3_other_bucket,
filename
)
def _wait(procs):
for p in procs:
p.join()
processes = []
proc_limit = 256 # limit concurrent processes to avoid "open too much files" error
for filename, bytes_range in tar_index.items():
# filename = "hello.txt"
# bytes_range = "1024-2048"
proc = Process(
target=_upload,
args=(filename, bytes_range)
)
proc.start()
processes.append(proc)
if len(processes) == proc_limit:
_wait(processes)
processes = []
_wait(processes)
This program is extract partial raw files in a tar file in a s3 bucket, then upload each raw file to another s3 bucket. There may be thousands of raw files in a tar file, so I use multiprocess to speed up s3 upload operation.
And, I got the exception in a subprocess about SSLError for processing the same tar file randomly. I tried different tar file and got the same result. Only the last one subprocess threw the exception, the remaining worked fine.
Process Process-2:
Traceback (most recent call last):
File "/var/runtime/urllib3/response.py", line 441, in _error_catcher
yield
File "/var/runtime/urllib3/response.py", line 522, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/var/lang/lib/python3.9/http/client.py", line 463, in read
n = self.readinto(b)
File "/var/lang/lib/python3.9/http/client.py", line 507, in readinto
n = self.fp.readinto(b)
File "/var/lang/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/var/lang/lib/python3.9/ssl.py", line 1242, in recv_into
return self.read(nbytes, buffer)
File "/var/lang/lib/python3.9/ssl.py", line 1100, in read
return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lang/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self._target(*self._args, **self._kwargs)
File "/var/task/main.py", line 144, in _upload
s3_client.upload_fileobj(
File "/var/runtime/boto3/s3/inject.py", line 540, in upload_fileobj
return future.result()
File "/var/runtime/s3transfer/futures.py", line 103, in result
return self._coordinator.result()
File "/var/runtime/s3transfer/futures.py", line 266, in result
raise self._exception
File "/var/runtime/s3transfer/tasks.py", line 269, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "/var/runtime/s3transfer/upload.py", line 588, in _submit
if not upload_input_manager.requires_multipart_upload(
File "/var/runtime/s3transfer/upload.py", line 404, in requires_multipart_upload
self._initial_data = self._read(fileobj, threshold, False)
File "/var/runtime/s3transfer/upload.py", line 463, in _read
return fileobj.read(amount)
File "/var/runtime/botocore/response.py", line 82, in read
chunk = self._raw_stream.read(amt)
File "/var/runtime/urllib3/response.py", line 544, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/var/lang/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "/var/runtime/urllib3/response.py", line 452, in _error_catcher
raise SSLError(e)
urllib3.exceptions.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)
According to this 10-years-ago similar question Multi-threaded S3 download doesn't terminate, the root cause might be boto3 s3 upload use a non-thread-safe library for sending http request. But, the solution doesn't work for me.
I found a boto3 issue about my question. This the problem has disappeared without any change on the author part.
Actually, the problem has recently disappeared on its own, without any (!) change on my part. As I thought, the problem was created and fixed by Amazon. I'm only afraid what if it will be a thing again...
Does anyone know how to fix this?

According to boto3 documentation about multiprocessing (doc),
Resource instances are not thread safe and should not be shared across threads or processes. These special classes contain additional meta data that cannot be shared. It's recommended to create a new Resource for each thread or process:
My modified code,
def multiprocess_s3upload(tar_index: dict):
def _upload(filename, bytes_range):
src_key = ...
# get single raw file in tar with bytes range
s3_client = boto3.client(service_name="s3") # <<<< one clien per thread
s3_obj = s3_client.get_object(
Bucket=s3_bucket,
Key=src_key,
Range=f"bytes={bytes_range}"
)
# upload raw file
s3_client.upload_fileobj(
s3_obj["Body"],
s3_other_bucket,
filename
)
def _wait(procs):
...
...
It seems that no SSLError exception occurs.

Related

s3fs timeout on big S3 files

This is similar to dask read_csv timeout on Amazon s3 with big files, but that didn't actually resolve my question.
import s3fs
fs = s3fs.S3FileSystem()
fs.connect_timeout = 18000
fs.read_timeout = 18000 # five hours
fs.download('s3://bucket/big_file','local_path_to_file')
The error I then get is
Traceback (most recent call last):
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiobotocore/response.py", line 50, in read
chunk = await self.__wrapped__.read(amt if amt is not None else -1)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/spec.py", line 1113, in download
return self.get(rpath, lpath, recursive=recursive, **kwargs)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 281, in get
return sync(self.loop, self._get, rpaths, lpaths)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 266, in _get
return await asyncio.gather(
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/s3fs/core.py", line 701, in _get_file
chunk = await body.read(2**16)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiobotocore/response.py", line 52, in read
raise AioReadTimeoutError(endpoint_url=self.__wrapped__.url,
aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ptpiskiss.s3.eu-west-1.amazonaws.com/REBUILD%20FOR%20TIME%20SERIES/v30a%20sept%202019.accdb"
Which is strange, because I thought I was setting the appropriate timeouts on the worker copy of the class. It's solely due to my bad internet connection, but is there something I need to do on my s3 end to assist here?

Error trying to connect Celery through SQS using STS

I'm trying to use Celery with SQS as broker. In order to use the SQS from my container I need to assume a role and for that I'm using STS. My code looks like this:
role_info = {
'RoleArn': 'arn:aws:iam::xxxxxxx:role/my-role-execution',
'RoleSessionName': 'roleExecution'
}
sts_client = boto3.client('sts', region_name='eu-central-1')
credentials = sts_client.assume_role(**role_info)
aws_access_key_id = credentials["Credentials"]['AccessKeyId']
aws_secret_access_key = credentials["Credentials"]['SecretAccessKey']
aws_session_token = credentials["Credentials"]["SessionToken"]
os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_access_key
os.environ["AWS_DEFAULT_REGION"] = 'eu-central-1'
os.environ["AWS_SESSION_TOKEN"] = aws_session_token
broker = "sqs://"
backend = 'redis://redis-service:6379/0'
celery = Celery('tasks', broker=broker, backend=backend)
celery.conf["task_default_queue"] = 'my-queue'
celery.conf["broker_transport_options"] = {
'region': 'eu-central-1',
'predefined_queues': {
'my-queue': {
'url': 'https://sqs.eu-central-1.amazonaws.com/xxxxxxx/my-queue'
}
}
}
In the same file I have the following task:
#celery.task(name='my-queue.my_task')
def my_task(content) -> int:
print("hello")
return 0
When I execute the following code I get an error:
[2020-09-24 10:38:03,602: CRITICAL/MainProcess] Unrecoverable error: ClientError('An error occurred (AccessDenied) when calling the ListQueues operation: Access to the resource https://eu-central-1.queue.amazonaws.com/ is denied.',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 921, in create_channel
return self._avail_channels.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/worker/worker.py", line 208, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/connection.py", line 23, in start
c.connection = c.connect()
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 405, in connect
conn = self.connection_for_read(heartbeat=self.amqheartbeat)
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 412, in connection_for_read
self.app.connection_for_read(heartbeat=heartbeat))
File "/usr/local/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 439, in ensure_connected
callback=maybe_shutdown,
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 422, in ensure_connection
callback, timeout=timeout)
File "/usr/local/lib/python3.6/site-packages/kombu/utils/functional.py", line 341, in retry_over_time
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 275, in connect
return self.connection
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 823, in connection
self._connection = self._establish_connection()
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 778, in _establish_connection
conn = self.transport.establish_connection()
File "/usr/local/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 941, in establish_connection
self._avail_channels.append(self.create_channel(self))
File "/usr/local/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 923, in create_channel
channel = self.Channel(connection)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/SQS.py", line 100, in __init__
self._update_queue_cache(self.queue_name_prefix)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/SQS.py", line 105, in _update_queue_cache
resp = self.sqs.list_queues(QueueNamePrefix=queue_name_prefix)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 337, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 656, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListQueues operation: Access to the resource https://eu-central-1.queue.amazonaws.com/ is denied.
If I use boto3 directly without Celery, I'm able to connect to the queue and retrieve data without this error. I don't know why Celery/Kombu try to list queues when I specify the predefined_queues configuration, tha is used to avoid these behavior (from docs):
If you want Celery to use a set of predefined queues in AWS, and to never attempt to list SQS queues, nor attempt to create or delete them, pass a map of queue names to URLs using the predefined_queue_urls setting
Source here
Anyone know what happens? How I should modify my code in order to make it work?. Seems that Celery is not using the credentials at all.
The versions I'm using:
celery==4.4.7
boto3==1.14.54
kombu==4.5.0
Thanks!
PS: I created and issue in Github to track if this can be a library error or not...
I solved the problem updating dependencies to the latest versions:
celery==5.0.0
boto3==1.14.54
kombu==5.0.2
pycurl==7.43.0.6
I was able to get celery==4.4.7 and kombu==4.6.11 working by setting the following configuration option:
celery.conf["task_create_missing_queues"] = False

Timeout issue with python google cloud storage libraries when using Python Multithreading

The problem that I'm trying to solve is to unzip a bunch of zip files sitting in the GCS bucket. As part of processing these files, I update the file status accordingly in a MongoDB using API endpoints. I have a method that takes a dictionary as an argument and do the following.
Update the status of the file to PROCESSING
Download the file from the GCS bucket.
Unzip the file.
Upload the unzipped file back to the GCS bucket.
Update the file status as FINISHED
As I have multiple files sitting in the bucket to unzip tried using python multithreading to parallelize the operation to process each zip file in a separate thread, but after a minute or so getting the requests.exceptions.Timeout error thrown by google libraries.
When I try processing the files without using multithreading then it works fine without any error. Below are the code and stack trace for reference. Any help is much appreciated.
def process_files_multithreading(self, file_info):
print("process id: " +str(os.getpid()))
try:
print("Updating the file status to PROCESSING")
update_file_status_url = utils.getUpdateFileStatusUrl(config["host"],
file_info["fileId"])
utils.updateFileStatus(update_file_status_url,"PROCESSING")
client = gcs_utils._getGCSConnection()
blob_path = file_info["fileLocation"]
bucket = client.get_bucket(file_info["bucketName"])
blob = bucket.blob(blob_path)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as myzip:
print(myzip.namelist())
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
blob = bucket.blob(blob_path + "/" + contentfilename)
blob.upload_from_string(contentfile)
print("proccessed file {} successfully".format(blob_path))
print("updating the file status to FINISHED")
update_file_status_url = utils.getUpdateFileStatusUrl(config["host"], file_info["fileId"])
utils.updateFileStatus(update_file_status_url,"FINISHED")
except(Exception):
print("Exception in processing the file")
print("updating the file status to FAILED")
update_file_status_url = utils.getUpdateFileStatusUrl(config["host"], file_info["fileId"])
utils.updateFileStatus(update_file_status_url,"FAILED")
print(traceback.format_exc())
'''
calling the above function using ThreadPoolExecutor
'''
# files is the list of dictionaries
with ThreadPoolExecutor(len(files)) as pool:
pool.map(self.process_files_multithreading, files)
Exception:
Traceback (most recent call last):
blob.upload_from_string(contentfile)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1361, in upload_from_string
predefined_acl=predefined_acl,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1261, in upload_from_file
client, file_obj, content_type, size, num_retries, predefined_acl
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1171, in _do_upload
client, stream, content_type, size, num_retries, predefined_acl
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1118, in _do_resumable_upload
response = upload.transmit_next_chunk(transport)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 425, in transmit_next_chunk
retry_strategy=self._retry_strategy,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/resumable_media/requests/_helpers.py", line 136, in http_request
return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
response = func()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/auth/transport/requests.py", line 287, in request
**kwargs
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/google/auth/transport/requests.py", line 110, in __exit__
raise self._timeout_error_type()
requests.exceptions.Timeout

Python script timeouts while uploading to an ftp server

I'm creating an csv file and uploading it to a ftp server.
So the upload includes paths of multiple companies on that ftp server.
like :
COMANY1/FOO/BAR
But for some companies at the end I get this traceback:
Traceback (most recent call last):
File "exporter.py", line 117, in <module>
Exporter().run()
File "exporter.py", line 115, in run
self.upload(file, vendor['ftp_path'], filename)
File "exporter.py", line 78, in upload
sftp.chdir(dir)
File "/home/johndoe/exports/daily/venv/local/lib/python2.7/site-packages/paramiko/sftp_client.py", line 580, in chdir
if not stat.S_ISDIR(self.stat(path).st_mode):
File "/home/johndoe/exports/daily/venv/local/lib/python2.7/site-packages/paramiko/sftp_client.py", line 413, in stat
t, msg = self._request(CMD_STAT, path)
File "/home/johndoe/exports/daily/venv/local/lib/python2.7/site-packages/paramiko/sftp_client.py", line 730, in _request
return self._read_response(num)
File "/home/johndoe/exports/daily/venv/local/lib/python2.7/site-packages/paramiko/sftp_client.py", line 781, in _read_response
self._convert_status(msg)
File "/home/johndoe/exports/daily/venv/local/lib/python2.7/site-packages/paramiko/sftp_client.py", line 807, in _convert_status
raise IOError(errno.ENOENT, text)
IOError: [Errno 2] The requested file does not exist
I think the stack trace refers to the ftp path, that it doesn't exists (but It does exist).
But when I try to run that script on those exact same paths alone (that are causing the problem) it passes?
So, its not a logical error, and those ftp paths do exist - but can it be due to some timeout occurring?
Thanks,
Tom
Update:
I call the method like this:
self.upload(file, vendor['ftp_path'], filename)
And the actual method that does the uploading:
def upload(self, buffer, ftp_dir, filename):
for dir in ftp_dir.split('/'):
if dir == '':
continue
sftp.chdir(dir)
with sftp.open(filename, 'w') as f:
f.write(buffer)
f.close()
sftp.close()

Python script suddenly throwing timeout exceptions

I have a Python script that downloads product feeds from multiple affiliates in different ways. This didn't give me any problems until last Wednesday, when it started throwing all kinds of timeout exceptions from different locations.
Examples: Here I connect with a FTP service:
ftp = FTP(host=self.host)
threw:
Exception in thread Thread-7:
Traceback (most recent call last):
File "C:\Python27\Lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\Lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Users\Administrator\Documents\Crawler\src\Crawlers\LDLC.py", line 23, in main
ftp = FTP(host=self.host)
File "C:\Python27\Lib\ftplib.py", line 120, in __init__
self.connect(host)
File "C:\Python27\Lib\ftplib.py", line 138, in connect
self.welcome = self.getresp()
File "C:\Python27\Lib\ftplib.py", line 215, in getresp
resp = self.getmultiline()
File "C:\Python27\Lib\ftplib.py", line 201, in getmultiline
line = self.getline()
File "C:\Python27\Lib\ftplib.py", line 186, in getline
line = self.file.readline(self.maxline + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
Or downloading an XML File :
xmlFile = urllib.URLopener()
xmlFile.retrieve(url, self.feedPath + affiliate + "/" + website + '.' + fileType)
xmlFile.close()
throws:
File "C:\Users\Administrator\Documents\Crawler\src\Crawlers\FeedCrawler.py", line 106, in save
xmlFile.retrieve(url, self.feedPath + affiliate + "/" + website + '.' + fileType)
File "C:\Python27\Lib\urllib.py", line 240, in retrieve
fp = self.open(url, data)
File "C:\Python27\Lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "C:\Python27\Lib\urllib.py", line 346, in open_http
errcode, errmsg, headers = h.getreply()
File "C:\Python27\Lib\httplib.py", line 1139, in getreply
response = self._conn.getresponse()
File "C:\Python27\Lib\httplib.py", line 1067, in getresponse
response.begin()
File "C:\Python27\Lib\httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "C:\Python27\Lib\httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
IOError: [Errno socket error] timed out
These are just two examples but there are other methods, like authenticate or other API specific methods where my script throws these timeout errors. It never showed this behavior until Wednesday. Also, it starts throwing them at random times. Sometimes at the beginning of the crawl, sometimes later on. My script has this behavior on both my server and my local machine. I've been struggling with it for two days now but can't seem to figure it out.
This is what I know might have caused this:
On Wednesday one affiliate script broke down with the following error:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)>
I didn't change anything to my script but suddenly it stopped crawling that affiliate and threw that error all the time where I tried to authenticate. I looked it up and found that is was due to an OpenSSL error (where did that come from). I fixed it by adding the following before the authenticate method:
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
Little did I know, this was just the start of my problems... At that same time, I changed from Python 2.7.8 to Python 2.7.9. It seems that this is the moment that everything broke down and started throwing timeouts.
I tried changing my script in endless ways but nothing worked and like I said, it's not just one method that throws it. Also I switched back to Python 2.7.8, but this didn't do the trick either. Basically everything that makes a request to an external source can throw an error.
Final note: My script is multi threaded. It downloads product feeds from different affiliates at the same time. It used to run 10 threads per affiliate without a problem. Now I tried lowering it to 3 per affiliate, but it still throws these errors. Setting it to 1 is no option because that will take ages. I don't think that's the problem anyway because it used to work fine.
What could be wrong?

Categories

Resources