I am keeping a naive connection pool using a python dictionary. As reference, I am using asyncio within Sanic if that matters.
Not often, but at times, I get this error:
Traceback (most recent call last):
File "/Users/Documents/venv/lib/python3.6/site-packages/sanic/app.py", line 556, in handle_request
response = await response
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/services.py", line 181, in dev_execute_cmd
return HTTPResponse(output, content_type='application/json')
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/services.py", line 132, in dev_execute_cmd
uid, last_cmd, mode)
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/managers.py", line 260, in async_dev_execute_cmd
# return False if max connections have been exceeded
File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/coroutines.py", line 110, in __next__
return self.gen.send(None)
File "/Users/Documents/Project/<proj>/<dir>/devices/managers.py", line 122, in async_open_connection
ip_addr_conn_count = self.per_dev_conn_count.get(device.ip_addr, 0) + 1
KeyError: '10.32.255.80'
My question is - how is a key error using .get() possible? In what scenarios can this happen? One thing I have noticed is that this error only occurs when I'm running concurrent requests albeit rarely.
In my understanding, asyncio uses an event loop so it schedules tasks and pauses tasks as it waits. So in my mind, 2+ concurrent requests should never really hit the same dictionary at the same time.
Thanks in advance!
Related
Recently, I have been very frequently encountering the error pymongo.errors.AutoReconnect. What is strange is that I never really encountered this error before and it just started happening for some reason.
Here is the full stack trace:
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 259, in next
doc = self._try_next(True)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 270, in _try_next
self._refresh()
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 196, in _refresh
self.__send_message(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/command_cursor.py", line 139, in __send_message
response = client._run_operation_with_response(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1342, in _run_operation_with_response
return self._retryable_read(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1464, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1334, in _cmd
return server.run_operation_with_response(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/server.py", line 117, in run_operation_with_response
reply = sock_info.receive_message(request_id)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/pool.py", line 646, in receive_message
self._raise_connection_failure(error)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/pool.py", line 643, in receive_message
return receive_message(self.sock, request_id,
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/network.py", line 196, in receive_message
_receive_data_on_socket(sock, 16))
File "/home/ubuntu/.local/lib/python3.9/site-packages/pymongo/network.py", line 261, in _receive_data_on_socket
raise AutoReconnect("connection closed")
pymongo.errors.AutoReconnect: connection closed
This error is raised by various queries in various situations. I'm really not certain why this issue started when it wasn't a problem before.
I have tried correcting the issue at one place where the error was most common by utilizing an exponential backoff.
retry_count = 0
while retry_count < 10:
try:
collection.remove({"index" : {"$lt" : (theTime- datetime.timedelta(minutes=60)).timestamp()}})
break
except:
wait_t = 0.5 * pow(2, retry_count)
self.myLogger.error(f"An error occurred while trying to delete entries. Retry : {retry_count}, Wait Time : {wait_t}")
time.sleep( wait_t )
finally:
retry_count = retry_count + 1
So far, this appears to have worked. My question is as follows:
I have a lot of various MongoDB queries. It would be very tedious to track down each of these queries and wrap it into an exponential backoff block that I have above. Is there a way to apply this in all situations when MongoDB encounters this error, so that I don't have to manually add this code everywhere?
Adding this block of code to each MongoDB query would be excessively tedious.
In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?
Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html
I have a Scrapy multi-level spider which works locally, but returns GeneratorExit in Cloud on every request.
Here're parse methods:
def parse(self, response):
results = list(response.css(".list-group li a::attr(href)"))
for c in results:
meta = {}
for key in response.meta.keys():
meta[key] = response.meta[key]
yield response.follow(c,
callback=self.parse_category,
meta=meta,
errback=self.errback_httpbin)
def parse_category(self, response):
category_results = list(response.css(
".item a.link-unstyled::attr(href)"))
category = response.css(".active [itemprop='title']")
for r in category_results:
meta = {}
for key in response.meta.keys():
meta[key] = response.meta[key]
meta["category"] = category
yield response.follow(r, callback=self.parse_item,
meta=meta,
errback=self.errback_httpbin)
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
Here's the traceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
[stderr] Exception ignored in: <generator object iter_errback at 0x7fdea937a9e8>
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1243, in run
self.mainLoop()
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1252, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 671, in _tick
taskObj._oneWorkUnit()
--- <exception caught here> ---
File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 517, in _oneWorkUnit
result = next(self._iterator)
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 63, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 57, in enqueue_request
dqok = self._dqpush(request)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 86, in _dqpush
self.dqs.push(reqd, -request.priority)
File "/usr/local/lib/python3.6/site-packages/queuelib/pqueue.py", line 35, in push
q.push(obj) # this may fail (eg. serialization error)
File "/usr/local/lib/python3.6/site-packages/scrapy/squeues.py", line 15, in push
s = serialize(obj)
File "/usr/local/lib/python3.6/site-packages/scrapy/squeues.py", line 27, in _pickle_serialize
return pickle.dumps(obj, protocol=2)
builtins.TypeError: can't pickle HtmlElement objects
I set a errback but it doesn't provide any error details. Also I wrote meta in every request, but it doesn't make any difference. Am I missing something?
Update:
It seems that the error is inherent to multi level spiders in particular. For now, I rewrote this one with just one parse method.
One of the differences between running a job locally and on Scrapy Cloud is that the JOBDIR setting is enabled, which makes Scrapy serialize requests into a disk queue instead of a memory one.
When serializing to disk, the Pickle operation fails because your request.meta dict contains a SelectorList object (assigned in the line category = response.css(".active [itemprop='title']")), and the selectors contain instances of lxml.html.HtmlElement objects (which cannot be pickled, and this issue is not in the Scrapy scope), hence the TypeError: can't pickle HtmlElement objects.
There is a merged pull request that addresses this issue. It does not fix the Pickle operation, what it does is indicate the Scheduler that it should not try to serialize to disk these kind of requests, they go to memory instead.
I'm trying to create a web app that communicates with Telegram. And trying to use Sanic web framework with Telepot. Both are asyncio based. Now I'm getting a very weird error.
This is my code:
import datetime
import telepot.aio
from sanic import Sanic
app = Sanic(__name__, load_env=False)
app.config.LOGO = ''
#app.listener('before_server_start')
async def server_init(app, loop):
app.bot = telepot.aio.Bot('anything', loop=loop)
# here we fall
await app.bot.sendMessage(
"#test",
"Wao! {}".format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),)
)
if __name__ == "__main__":
app.run(
debug=True
)
The error that I'm getting is:
[2018-01-18 22:41:43 +0200] [10996] [ERROR] Experienced exception while trying to serve
Traceback (most recent call last):
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/sanic/app.py", line 646, in run
serve(**server_settings)
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/sanic/server.py", line 588, in serve
trigger_events(before_start, loop)
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/sanic/server.py", line 496, in trigger_events
loop.run_until_complete(result)
File "uvloop/loop.pyx", line 1364, in uvloop.loop.Loop.run_until_complete
File "/home/mk/Dev/project/sanic-telepot.py", line 14, in server_init
"Wao! {}".format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),)
File "/usr/lib/python3.6/asyncio/coroutines.py", line 109, in __next__
return self.gen.send(None)
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/telepot/aio/__init__.py", line 100, in sendMessage
return await self._api_request('sendMessage', _rectify(p))
File "/usr/lib/python3.6/asyncio/coroutines.py", line 109, in __next__
return self.gen.send(None)
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/telepot/aio/__init__.py", line 78, in _api_request
return await api.request((self._token, method, params, files), **kwargs)
File "/usr/lib/python3.6/asyncio/coroutines.py", line 109, in __next__
return self.gen.send(None)
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/telepot/aio/api.py", line 139, in request
async with fn(*args, **kwargs) as r:
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/aiohttp/client.py", line 690, in __aenter__
self._resp = yield from self._coro
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/aiohttp/client.py", line 221, in _request
with timer:
File "/home/mk/Dev/project/venv/lib/python3.6/site-packages/aiohttp/helpers.py", line 712, in __enter__
raise RuntimeError('Timeout context manager should be used '
RuntimeError: Timeout context manager should be used inside a task
Telepot inside uses aiohttp as dependency and for the HTTP calls. And the very similar code is working if I make very similar functionality with just aiohttp.web.
I'm not sure to what project this problem is related. Also, all other dependencies like redis, database connections that I connected the same approach are working perfectly.
Any suggestions how to fix it?
We are working with celery at the last year, with ~15 workers, each one defined with concurrency between 1-4.
Recently we upgraded our celery from v3.1 to v4.1
Now we are having the following errors in each one of the workers logs, any ideas what can cause to such error?
2017-08-21 18:33:19,780 94794 ERROR Control command error: error(104, 'Connection reset by peer') [file: pidbox.py, line: 46]
Traceback (most recent call last):
File "/srv/dy/venv/lib/python2.7/site-packages/celery/worker/pidbox.py", line 42, in on_message
self.node.handle_message(body, message)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 129, in handle_message
return self.dispatch(**body)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 112, in dispatch
ticket=ticket)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 135, in reply
serializer=self.mailbox.serializer)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 265, in _publish_reply
**opts
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/channel.py", line 1748, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 64, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 178, in write_frame
write(view[:offset])
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/transport.py", line 272, in write
self._write(s)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer
BTW: our tasks in the form:
#app.task(name='EXAMPLE_TASK'],
bind=True,
base=ConnectionHolderTask)
def example_task(self, arg1, arg2, **kwargs):
# task code
We are also having massive issues with celery... I spend 20% of my time just dancing around weird idle-hang/crash issues with our workers sigh
We had a similar case that was caused by a high concurrency combined with a high worker_prefetch_multiplier, as it turns out fetching thousands of tasks is a good way to frack the connection.
If that's not the case: try to disable the broker pool by setting broker_pool_limit to None.
Just some quick ideas that might (hopefully) help :-)