dask: How do I avoid timeout for a task? - python

In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?

Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html

Related

Automatically handling pymongo.errors.AutoReconnect errors?

Recently, I have been very frequently encountering the error pymongo.errors.AutoReconnect. What is strange is that I never really encountered this error before and it just started happening for some reason.
Here is the full stack trace:
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 259, in next
doc = self._try_next(True)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 270, in _try_next
self._refresh()
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 196, in _refresh
self.__send_message(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 139, in __send_message
response = client._run_operation_with_response(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1342, in _run_operation_with_response
return self._retryable_read(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1464, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1334, in _cmd
return server.run_operation_with_response(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/server.py", line 117, in run_operation_with_response
reply = sock_info.receive_message(request_id)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/pool.py", line 646, in receive_message
self._raise_connection_failure(error)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/pool.py", line 643, in receive_message
return receive_message(self.sock, request_id,
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/network.py", line 196, in receive_message
_receive_data_on_socket(sock, 16))
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/network.py", line 261, in _receive_data_on_socket
raise AutoReconnect("connection closed")
pymongo.errors.AutoReconnect: connection closed
This error is raised by various queries in various situations. I'm really not certain why this issue started when it wasn't a problem before.
I have tried correcting the issue at one place where the error was most common by utilizing an exponential backoff.
retry_count = 0
while retry_count < 10:
try:
collection.remove({"index" : {"$lt" : (theTime- datetime.timedelta(minutes=60)).timestamp()}})
break
except:
wait_t = 0.5 * pow(2, retry_count)
self.myLogger.error(f"An error occurred while trying to delete entries. Retry : {retry_count}, Wait Time : {wait_t}")
time.sleep( wait_t )
finally:
retry_count = retry_count + 1
So far, this appears to have worked. My question is as follows:
I have a lot of various MongoDB queries. It would be very tedious to track down each of these queries and wrap it into an exponential backoff block that I have above. Is there a way to apply this in all situations when MongoDB encounters this error, so that I don't have to manually add this code everywhere?
Adding this block of code to each MongoDB query would be excessively tedious.

dictionary get() throwing key error

I am keeping a naive connection pool using a python dictionary. As reference, I am using asyncio within Sanic if that matters.
Not often, but at times, I get this error:
Traceback (most recent call last):
File "/Users/Documents/venv/lib/python3.6/site-packages/sanic/app.py", line 556, in handle_request
response = await response
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/services.py", line 181, in dev_execute_cmd
return HTTPResponse(output, content_type='application/json')
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/services.py", line 132, in dev_execute_cmd
uid, last_cmd, mode)
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/managers.py", line 260, in async_dev_execute_cmd
# return False if max connections have been exceeded
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/managers.py", line 122, in async_open_connection
ip_addr_conn_count = self.per_dev_conn_count.get(device.ip_addr, 0) + 1
KeyError: '10.32.255.80'
My question is - how is a key error using .get() possible? In what scenarios can this happen? One thing I have noticed is that this error only occurs when I'm running concurrent requests albeit rarely.
In my understanding, asyncio uses an event loop so it schedules tasks and pauses tasks as it waits. So in my mind, 2+ concurrent requests should never really hit the same dictionary at the same time.
Thanks in advance!

Celery upgrade (3.1->4.1) - Connection reset by peer

We are working with celery at the last year, with ~15 workers, each one defined with concurrency between 1-4.
Recently we upgraded our celery from v3.1 to v4.1
Now we are having the following errors in each one of the workers logs, any ideas what can cause to such error?
2017-08-21 18:33:19,780 94794 ERROR Control command error: error(104, 'Connection reset by peer') [file: pidbox.py, line: 46]
Traceback (most recent call last):
File "/srv/dy/venv/lib/python2.7/site-packages/celery/worker/pidbox.py", line 42, in on_message
self.node.handle_message(body, message)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 129, in handle_message
return self.dispatch(**body)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 112, in dispatch
ticket=ticket)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 135, in reply
serializer=self.mailbox.serializer)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 265, in _publish_reply
**opts
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/channel.py", line 1748, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 64, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 178, in write_frame
write(view[:offset])
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/transport.py", line 272, in write
self._write(s)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer
BTW: our tasks in the form:
#app.task(name='EXAMPLE_TASK'],
bind=True,
base=ConnectionHolderTask)
def example_task(self, arg1, arg2, **kwargs):
# task code
We are also having massive issues with celery... I spend 20% of my time just dancing around weird idle-hang/crash issues with our workers sigh
We had a similar case that was caused by a high concurrency combined with a high worker_prefetch_multiplier, as it turns out fetching thousands of tasks is a good way to frack the connection.
If that's not the case: try to disable the broker pool by setting broker_pool_limit to None.
Just some quick ideas that might (hopefully) help :-)

How to handle 60 secs timeout with heavy queries

I have to make some heavy queries in my datastore to obtain some high level information. When it reaches the 60 secs I get an error that I suppose its a timeout cut:
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 207, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 140, in xsrf_required_decorator
method(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 348, in post
exec(compiled_code, globals())
File "<string>", line 28, in <module>
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2314, in next
return self.__model_class.from_entity(self.__iterator.next())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 1442, in from_entity
return cls(None, _from_entity=entity, **entity_values)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 958, in __init__
if isinstance(_from_entity, datastore.Entity) and _from_entity.is_saved():
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore.py", line 814, in is_saved
self.__key.has_id_or_name())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore_types.py", line 565, in has_id_or_name
elems = self.__reference.path().element_list()
DeadlineExceededError
This is not an application query, I am interacting with my app through the Interactive Console, so this is not a live problem. My problem is that I have to iterate around all my application users, checking big amounts of data that I need to retrieve for each of them. I could do it one by one by hard coding their user_id, but it would be slow and non-efficient.
Can you guys think of any way I could do this faster? Is there anyway for selecting maybe 5 by five the users, like LIMIT=5 get only the first 5 users, but it would be great if I can get, first the 5 users, after that, the next 5 users and so on, iterating by all of them but with lighter queries. Can I set a longer timeout?
Any way you can think about I could deal with this problem?
You could use a cursor to pick up your search where you left off in conjunction with limit:
Returns a base64-encoded cursor string denoting the position in the query's result set following the last result retrieved. The cursor string is safe to use in HTTP GET and POST parameters, and can also be stored in the Datastore or Memcache. A future invocation of the same query can provide this string via the start_cursor parameter or the with_cursor() method to resume retrieving results from this position.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_cursor
I'd write a simple request handler to do the task.
Either write it in a way that it can be run on mapreduce, or launch a backend to run your handler.
First, by getting your entities in batches will reduce the communication time of your application with the datastore significantly. For details on this, take a look at 10 things you (probably) didn't know about App Engine
Then, you can assign this procedure to Task Queues that enable you to execute tasks up to 10 minutes. For more information on Task Queues, take a look at The Task Queue Python API.
Finally, for tasks that need more time you can also consider the use of Backends. For more information you can take a look at Backends (Python).
Hope this helps.

AppEngine 'explicitly cancelled' error

I'm using Google AppEngine and the deferred library, with the Mapper class, as described here (with some improvements as in here). In some iterations of the mapper I get the following error:
CancelledError: The API call datastore_v3.Put() was explicitly cancelled.
The Mapper usually runs fine, I used to have a higher batch size, so that it would actually hit the DeadlineExceededError, and that was handled correctly.
Just to be sure, I reduced the batch_size to a very low number, so that it never even hits the DeadlineExceededError but I still get the CancelledError.
The stack trace is as follows:
File "utils.py", line 114, in _continue
self._batch_write()
File "utils.py", line 76, in _batch_write
db.put(self.to_put)
File "/google/appengine/ext/db/__init__.py", line 1238, in put
keys = datastore.Put(entities, rpc=rpc)
File "/google/appengine/api/datastore.py", line 255, in Put
'datastore_v3', 'Put', req, datastore_pb.PutResponse(), rpc)
File "/google/appengine/api/datastore.py", line 177, in _MakeSyncCall
rpc.check_success()
File "/google/appengine/api/apiproxy_stub_map.py", line 474, in check_success
self.__rpc.CheckSuccess()
File "/google/appengine/api/apiproxy_rpc.py", line 126, in CheckSuccess
raise self.exception
CancelledError: The API call datastore_v3.Put() was explicitly cancelled.
I can't really find a lot of information about this 'explicity cancelled' error, so I was wondering what caused it and how to investigate.
After a DeadlineExceededError, you are allowed a short amount of grace time to handle the exception, eg defer the remainder of the computation.
If you run out of grace time the CancelledError kicks in.
There should be no way to catch/handle the CancelledError

Categories

Resources