s3fs timeout on big S3 files

s3fs timeout on big S3 files - python

This is similar to dask read_csv timeout on Amazon s3 with big files, but that didn't actually resolve my question.
import s3fs
fs = s3fs.S3FileSystem()
fs.connect_timeout = 18000
fs.read_timeout = 18000 # five hours
fs.download('s3://bucket/big_file','local_path_to_file')
The error I then get is
Traceback (most recent call last):
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiobotocore/response.py", line 50, in read
chunk = await self.__wrapped__.read(amt if amt is not None else -1)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/spec.py", line 1113, in download
return self.get(rpath, lpath, recursive=recursive, **kwargs)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 281, in get
return sync(self.loop, self._get, rpaths, lpaths)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/fsspec/asyn.py", line 266, in _get
return await asyncio.gather(
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/s3fs/core.py", line 701, in _get_file
chunk = await body.read(2**16)
File "/Users/christopherturnbull/PointTopic/PointTopic/lib/python3.9/site-packages/aiobotocore/response.py", line 52, in read
raise AioReadTimeoutError(endpoint_url=self.__wrapped__.url,
aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ptpiskiss.s3.eu-west-1.amazonaws.com/REBUILD%20FOR%20TIME%20SERIES/v30a%20sept%202019.accdb"
Which is strange, because I thought I was setting the appropriate timeouts on the worker copy of the class. It's solely due to my bad internet connection, but is there something I need to do on my s3 end to assist here?

Related

Use multiprocess for boto3 s3 upload_fileobj causes SSLError

In AWS Lambda with runtimes python3.9 and boto3-1.20.32, I run the following code,
s3_client = boto3.client(service_name="s3")
s3_bucket = "bucket"
s3_other_bucket = "other_bucket"
def multiprocess_s3upload(tar_index: dict):
def _upload(filename, bytes_range):
src_key = ...
# get single raw file in tar with bytes range
s3_obj = s3_client.get_object(
Bucket=s3_bucket,
Key=src_key,
Range=f"bytes={bytes_range}"
)
# upload raw file
# error occur !!!!!
s3_client.upload_fileobj(
s3_obj["Body"],
s3_other_bucket,
filename
)
def _wait(procs):
for p in procs:
p.join()
processes = []
proc_limit = 256 # limit concurrent processes to avoid "open too much files" error
for filename, bytes_range in tar_index.items():
# filename = "hello.txt"
# bytes_range = "1024-2048"
proc = Process(
target=_upload,
args=(filename, bytes_range)
)
proc.start()
processes.append(proc)
if len(processes) == proc_limit:
_wait(processes)
processes = []
_wait(processes)
This program is extract partial raw files in a tar file in a s3 bucket, then upload each raw file to another s3 bucket. There may be thousands of raw files in a tar file, so I use multiprocess to speed up s3 upload operation.
And, I got the exception in a subprocess about SSLError for processing the same tar file randomly. I tried different tar file and got the same result. Only the last one subprocess threw the exception, the remaining worked fine.
Process Process-2:
Traceback (most recent call last):
File "/var/runtime/urllib3/response.py", line 441, in _error_catcher
yield
File "/var/runtime/urllib3/response.py", line 522, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/var/lang/lib/python3.9/http/client.py", line 463, in read
n = self.readinto(b)
File "/var/lang/lib/python3.9/http/client.py", line 507, in readinto
n = self.fp.readinto(b)
File "/var/lang/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/var/lang/lib/python3.9/ssl.py", line 1242, in recv_into
return self.read(nbytes, buffer)
File "/var/lang/lib/python3.9/ssl.py", line 1100, in read
return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lang/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self._target(*self._args, **self._kwargs)
File "/var/task/main.py", line 144, in _upload
s3_client.upload_fileobj(
File "/var/runtime/boto3/s3/inject.py", line 540, in upload_fileobj
return future.result()
File "/var/runtime/s3transfer/futures.py", line 103, in result
return self._coordinator.result()
File "/var/runtime/s3transfer/futures.py", line 266, in result
raise self._exception
File "/var/runtime/s3transfer/tasks.py", line 269, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "/var/runtime/s3transfer/upload.py", line 588, in _submit
if not upload_input_manager.requires_multipart_upload(
File "/var/runtime/s3transfer/upload.py", line 404, in requires_multipart_upload
self._initial_data = self._read(fileobj, threshold, False)
File "/var/runtime/s3transfer/upload.py", line 463, in _read
return fileobj.read(amount)
File "/var/runtime/botocore/response.py", line 82, in read
chunk = self._raw_stream.read(amt)
File "/var/runtime/urllib3/response.py", line 544, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/var/lang/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "/var/runtime/urllib3/response.py", line 452, in _error_catcher
raise SSLError(e)
urllib3.exceptions.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2633)
According to this 10-years-ago similar question Multi-threaded S3 download doesn't terminate, the root cause might be boto3 s3 upload use a non-thread-safe library for sending http request. But, the solution doesn't work for me.
I found a boto3 issue about my question. This the problem has disappeared without any change on the author part.
Actually, the problem has recently disappeared on its own, without any (!) change on my part. As I thought, the problem was created and fixed by Amazon. I'm only afraid what if it will be a thing again...
Does anyone know how to fix this?

According to boto3 documentation about multiprocessing (doc),
Resource instances are not thread safe and should not be shared across threads or processes. These special classes contain additional meta data that cannot be shared. It's recommended to create a new Resource for each thread or process:
My modified code,
def multiprocess_s3upload(tar_index: dict):
def _upload(filename, bytes_range):
src_key = ...
# get single raw file in tar with bytes range
s3_client = boto3.client(service_name="s3") # <<<< one clien per thread
s3_obj = s3_client.get_object(
Bucket=s3_bucket,
Key=src_key,
Range=f"bytes={bytes_range}"
)
# upload raw file
s3_client.upload_fileobj(
s3_obj["Body"],
s3_other_bucket,
filename
)
def _wait(procs):
...
...
It seems that no SSLError exception occurs.

TelegramAPIError: Bad Gateway (aiogram)

I create a bot that notifies the user at certain times, but from time to time gives this error
dispatcher.py [LINE:390] ERROR | 2022-10-03 04:10:16,846 : Cause exception while getting updates.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/aiogram/dispatcher/dispatcher.py", line 381, in start_polling
updates = await self.bot.get_updates(
File "/usr/local/lib/python3.8/dist-packages/aiogram/bot/bot.py", line 110, in get_updates
result = await self.request(api.Methods.GET_UPDATES, payload)
File "/usr/local/lib/python3.8/dist-packages/aiogram/bot/base.py", line 231, in request
return await api.make_request(await self.get_session(), self.server, self.__token, method, data, files,
File "/usr/local/lib/python3.8/dist-packages/aiogram/bot/api.py", line 140, in make_request
return check_result(method, response.content_type, response.status, await response.text())
File "/usr/local/lib/python3.8/dist-packages/aiogram/bot/api.py", line 128, in check_result
raise exceptions.TelegramAPIError(description)
aiogram.utils.exceptions.TelegramAPIError: Bad Gateway
I think this problem is on the telegram side itself and is solved via webhook, but I don't want to use them.

How can i send large message to kafka producer using python?

if I send the largest Json to the Kafka server it will show this kind of error, How can I increase message.max.bytes=15728640 and replica.fetch.max.bytes=15728640 in Kafka. I tried to increase byte level as below it won't work
The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=15728640
The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=15728640
Error:=>
[2022-01-06 12:36:51,281] [9015] [ERROR] [^-App]: Crashed reason=ProducerSendError("Error while sending: MessageSizeTooLargeError('The message is 6677420 bytes when serialized which is larger than the maximum request size you have configured with the max_request_size configuration',)",)
Traceback (most recent call last):
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/transport/drivers/aiokafka.py", line 1059, in send
transactional_id=transactional_id,
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/aiokafka/producer/producer.py", line 310, in send
key_bytes, value_bytes = self._serialize(topic, key, value)
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/aiokafka/producer/producer.py", line 231, in _serialize
" max_request_size configuration" % message_size)
kafka.errors.MessageSizeTooLargeError: [Error 10] MessageSizeTooLargeError: The message is 6677420 bytes when serialized which is larger than the maximum request size you have configured with the max_request_size configuration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/mode/services.py", line 779, in _execute_task
await task
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/app/base.py", line 941, in _wrapped
return await task()
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/app/base.py", line 991, in around_timer
await fun(*args)
File "/home/twilightuser/faust_library/producer.py", line 14, in my_send
await topic.send(value=value)
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/topics.py", line 193, in send
callback=callback,
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/channels.py", line 303, in _send_now
schema, key_serializer, value_serializer, callback))
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/topics.py", line 417, in publish_message
headers=headers,
File "/home/twilightuser/faust_library/venv/lib/python3.6/site-packages/faust/transport/drivers/aiokafka.py", line 1062, in send
raise ProducerSendError(f'Error while sending: {exc!r}') from exc
faust.exceptions.ProducerSendError: Error while sending: MessageSizeTooLargeError('The message is 6677420 bytes when serialized which is larger than the maximum request size you have configured with the max_request_size configuration',)

gcsfs async interface does not seem to be working (error using simple example)

I am on Linux/Python3.8.5
python3 -m pip list | grep gcsfs
gcsfs 2021.4.0
I looked at the docs, specifically chapter 5 - Async: https://buildmedia.readthedocs.org/media/pdf/gcsfs/stable/gcsfs.pdf
I also found an example from here: https://github.com/dask/gcsfs/issues/285, which is shown below:
import asyncio
import gcsfs
async def main():
loop = asyncio.get_event_loop()
fs = gcsfs.GCSFileSystem(project="my_project", asynchronous=True, loop=loop)
await fs.set_session()
async with await fs.open("my_bucket/my_blob") as fp:
b = await fp.read()
print(len(b))
asyncio.get_event_loop().run_until_complete(main())
The error is:
Traceback (most recent call last):
File "test_gcsfs_async.py", line 13, in <module>
asyncio.get_event_loop().run_until_complete(main())
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "test_gcsfs_async.py", line 7, in main
await fs.set_session()
AttributeError: 'GCSFileSystem' object has no attribute 'set_session'
If I simply remove the call to set_session(), then I get this error:
Traceback (most recent call last):
File "test_gcsfs_async.py", line 13, in <module>
asyncio.get_event_loop().run_until_complete(main())
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "test_gcsfs_async.py", line 9, in main
async with await fs.open("my_bucket/my_blob") as fp:
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 942, in open
f = self._open(
File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 1247, in _open
return GCSFile(
File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 1378, in __init__
super().__init__(
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 1270, in __init__
self.details = fs.info(path)
File "/usr/local/lib/python3.8/dist-packages/fsspec/asyn.py", line 72, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fsspec/asyn.py", line 40, in sync
raise RuntimeError("Loop is not running")
RuntimeError: Loop is not running

Unfortunately, the file-like interface of GCSFile is not async-compatible. From the issue you linked:
The file-like interface itself is not asynchronous, since there is state (the current buffers and file positions).
This may be implemented in the future, but would require a certain amount of work.
Note that the coroutine method you were after is called _set_session (with the leading underscore). This should be better documented - feel free to raise an issue or submit a PR.

How to handle "Redis.exceptions.ConnectionError: Connection has data"

I receive following output:
Traceback (most recent call last):
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
raise ConnectionError('Connection has data')
redis.exceptions.ConnectionError: Connection has data
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/env/lib64/python3.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
timer()
File "/home/ec2-user/env/lib64/python3.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
cb(*args, **kw)
File "/home/ec2-user/env/lib64/python3.7/site-packages/eventlet/greenthread.py", line 214, in main
result = function(*args, **kwargs)
File "crawler.py", line 53, in fetch_listing
url = dequeue_url()
File "/home/ec2-user/WebCrawler/helpers.py", line 109, in dequeue_url
return redis.spop("listing_url_queue")
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/client.py", line 2255, in spop
return self.execute_command('SPOP', name, *args)
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/client.py", line 875, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/connection.py", line 1197, in get_connection
raise ConnectionError('Connection not ready')
redis.exceptions.ConnectionError: Connection not ready
I couldn't find any issue related to this particular error. I emptied/flushed all redis databases, so there should be no data there. I assume it has something to do with eventlet and patching. But even when I put following code right at the beginning of the file, the error appears.
import eventlet
eventlet.monkey_path()
What does this error mean?

Finally, I came up with the answer to my problem.
When connecting to redis with python, I specified the database with the number 0.
redis = redis.Redis(host=example.com, port=6379, db=0)
After changing the dabase to number 1 it worked.
redis = redis.Redis(host=example.com, port=6379, db=1)

Another way is to set protected_mode to no in etc\redis\redis.conf. Recommended when running redis locally.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

s3fs timeout on big S3 files - python

Related

Use multiprocess for boto3 s3 upload_fileobj causes SSLError

TelegramAPIError: Bad Gateway (aiogram)

How can i send large message to kafka producer using python?

gcsfs async interface does not seem to be working (error using simple example)

How to handle "Redis.exceptions.ConnectionError: Connection has data"

Categories

Resources