I'm using Google AppEngine and the deferred library, with the Mapper class, as described here (with some improvements as in here). In some iterations of the mapper I get the following error:
CancelledError: The API call datastore_v3.Put() was explicitly cancelled.
The Mapper usually runs fine, I used to have a higher batch size, so that it would actually hit the DeadlineExceededError, and that was handled correctly.
Just to be sure, I reduced the batch_size to a very low number, so that it never even hits the DeadlineExceededError but I still get the CancelledError.
The stack trace is as follows:
File "utils.py", line 114, in _continue
self._batch_write()
File "utils.py", line 76, in _batch_write
db.put(self.to_put)
File "/google/appengine/ext/db/__init__.py", line 1238, in put
keys = datastore.Put(entities, rpc=rpc)
File "/google/appengine/api/datastore.py", line 255, in Put
'datastore_v3', 'Put', req, datastore_pb.PutResponse(), rpc)
File "/google/appengine/api/datastore.py", line 177, in _MakeSyncCall
rpc.check_success()
File "/google/appengine/api/apiproxy_stub_map.py", line 474, in check_success
self.__rpc.CheckSuccess()
File "/google/appengine/api/apiproxy_rpc.py", line 126, in CheckSuccess
raise self.exception
CancelledError: The API call datastore_v3.Put() was explicitly cancelled.
I can't really find a lot of information about this 'explicity cancelled' error, so I was wondering what caused it and how to investigate.
After a DeadlineExceededError, you are allowed a short amount of grace time to handle the exception, eg defer the remainder of the computation.
If you run out of grace time the CancelledError kicks in.
There should be no way to catch/handle the CancelledError
Related
On a small Flask webserver running on a RaspberryPi with about 10-20 clients, we periodically get this error:
Error on request:
Traceback (most recent call last):
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/werkzeug/serving.py", line 270, in run_wsgi
execute(self.server.app)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/werkzeug/serving.py", line 258, in execute
application_iter = app(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/flask/app.py", line 2309, in __call__
return self.wsgi_app(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/flask_socketio/__init__.py", line 43, in __call__
start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/engineio/middleware.py", line 47, in __call__
return self.engineio_app.handle_request(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/socketio/server.py", line 360, in handle_request
return self.eio.handle_request(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/engineio/server.py", line 291, in handle_request
socket = self._get_socket(sid)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/engineio/server.py", line 427, in _get_socket
raise KeyError('Session is disconnected')
KeyError: 'Session is disconnected'
The error is generated automatically from inside python-socketio. What does this error really mean and how can I prevent or suppress it?
As far as I can tell, this usually means the server can't keep up with supplying data to all of the clients.
Some possible mitigation techniques include disconnecting inactive clients, reducing the amount of data sent where possible, sending live data in larger chunks, or upgrading the server. If you need a lot of data throughput, there may be also be a better option than socketIO.
I have been able to reproduce it by setting a really high ping rate and low timeout in the socketIO constructor:
from flask_socketio import SocketIO
socketio = SocketIO(engineio_logger=True, ping_timeout=5, ping_interval=5)
This means the server has to do a lot of messaging to all of the clients and they don't have long to respond. I then open around 10 clients and I start to see the KeyError.
Further debugging of our server found a process that was posting lots of live data which ran fine with only a few clients but starts to issue the occasional KeyError once I get up to about a dozen.
In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?
Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html
I have a text file (say, "X") stored on GCS and created and updated by GCS Client Library. I use GAE Python. On every addition of some data by user of my website, I add a Task (taskqueue.Task) to the "default" queue to do some actions including modification of file ("X").
Sometimes, I get the following error in the logs:
E 2014-07-20 03:19:06.238 500 3KB 430ms /t
0.1.0.2 - - [19/Jul/2014:14:49:06 -0700] "POST /t HTTP/1.1" 500 2569 "http://www.myappdomain.com/p" "AppEngine-Google; (+http://code.google.com/appengine)" "www.myappdomain.com" ms=430 cpu_ms=498 cpm_usd=0.000287 queue_name=default task_name=14629523467445182169 instance=00c61b117c48b4db44a58e0d454310843e7848 app_engine_release=1.9.7 trace_id=3db3eb580b76133e90947539c0446910
I 03:19:05.813 [class TaskQueueWorker] work=[sitemap_index_entry]
I 03:19:05.813 country_id=[US] country_name=[USA] state_id=[CA] state_name=[California] city_id=[SVL] city_name=[Sunnyvale]
I 03:19:05.836 locality_id_old=[-1] locality_id_new=[28]
I 03:19:05.879 locality_name_old=[] locality_name_new=[XYZ]
I 03:19:05.879 command=[ADD]
E 03:19:06.207 File on GCS has changed while reading.
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/main_v3.py", line 15259, in post
gcs_file = gcs.open (index_filename, mode='r')
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/cloudstorage_api.py", line 94, in open
buffer_size=read_buffer_size)
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/storage_api.py", line 220, in __init__
check_response_closure()
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/storage_api.py", line 448, in _checker
self._check_etag(resp_headers.get('etag'))
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/storage_api.py", line 476, in _check_etag
raise ValueError('File on GCS has changed while reading.')
ValueError: File on GCS has changed while reading.
I 03:19:06.235 Saved; key: __appstats__:045800, part: 144 bytes, full: 74513 bytes, overhead: 0.002 + 0.004; link: http://www.myappdomain.com/_ah/stats/details?time=1405806545812
I suspect that multiple triggered tasks try to open and update the file ("X") at the same time. And that causes the above exception. Please suggest a way to lock access to that file so that only one task is able to modify it at a time(similar to a transaction).
Appreciate your help and guidance.
UPDATE
Another way to prevent the above problem could be to modify one of the following queue.yaml parameter for the queue:
bucket_size
OR
max_concurrent_requests
But, not sure which one to modify.
A task queue of max_concurrent_requests = 1 should ensure that only one edit is made at a time to a file.
https://developers.google.com/appengine/docs/python/config/queue#Python_Defining_push_queues_and_processing_rates
If you want to prevent too many tasks from running at once or to
prevent datastore contention, you use max_concurrent_requests.
max_concurrent_requests (push queues only)
Sets the maximum number of tasks that can be executed at any given
time in the specified queue. The value is an integer. By default, this
directive is unset and there is no limit on the maximum number of
concurrent tasks. One use of this directive is to prevent too many
tasks from running at once or to prevent datastore contention.
Restricting the maximum number of concurrent tasks gives you more
control over your queue's rate of execution. For example, you can
constrain the number of instances that are running the queue's tasks.
Limiting the number of concurrent requests in a given queue allows you
to make resources available for other queues or online processing.
Of course, you should build in logic that'll allow failed tasks to re-try etc, or you may end up with worse problems then you have now.
It is also possible to rely on GCS itself using preconditions
This allows updating a file only. See the docs:
Preconditions are often used in mutating requests — uploads, deletes, copies, or metadata updates — to prevent race conditions. Race conditions can arise when the same request is sent repeatedly or when independent processes interfere with each other. For example, multiple request retries after a network interruption, or users performing a read-modify-write operation on the same object can create race conditions.
I have to make some heavy queries in my datastore to obtain some high level information. When it reaches the 60 secs I get an error that I suppose its a timeout cut:
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 207, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 140, in xsrf_required_decorator
method(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 348, in post
exec(compiled_code, globals())
File "<string>", line 28, in <module>
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2314, in next
return self.__model_class.from_entity(self.__iterator.next())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 1442, in from_entity
return cls(None, _from_entity=entity, **entity_values)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 958, in __init__
if isinstance(_from_entity, datastore.Entity) and _from_entity.is_saved():
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore.py", line 814, in is_saved
self.__key.has_id_or_name())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore_types.py", line 565, in has_id_or_name
elems = self.__reference.path().element_list()
DeadlineExceededError
This is not an application query, I am interacting with my app through the Interactive Console, so this is not a live problem. My problem is that I have to iterate around all my application users, checking big amounts of data that I need to retrieve for each of them. I could do it one by one by hard coding their user_id, but it would be slow and non-efficient.
Can you guys think of any way I could do this faster? Is there anyway for selecting maybe 5 by five the users, like LIMIT=5 get only the first 5 users, but it would be great if I can get, first the 5 users, after that, the next 5 users and so on, iterating by all of them but with lighter queries. Can I set a longer timeout?
Any way you can think about I could deal with this problem?
You could use a cursor to pick up your search where you left off in conjunction with limit:
Returns a base64-encoded cursor string denoting the position in the query's result set following the last result retrieved. The cursor string is safe to use in HTTP GET and POST parameters, and can also be stored in the Datastore or Memcache. A future invocation of the same query can provide this string via the start_cursor parameter or the with_cursor() method to resume retrieving results from this position.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_cursor
I'd write a simple request handler to do the task.
Either write it in a way that it can be run on mapreduce, or launch a backend to run your handler.
First, by getting your entities in batches will reduce the communication time of your application with the datastore significantly. For details on this, take a look at 10 things you (probably) didn't know about App Engine
Then, you can assign this procedure to Task Queues that enable you to execute tasks up to 10 minutes. For more information on Task Queues, take a look at The Task Queue Python API.
Finally, for tasks that need more time you can also consider the use of Backends. For more information you can take a look at Backends (Python).
Hope this helps.
Quite often GAE is not able to upload the file and I am getting the following error:
ApplicationError: 2
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 636, in __call__
handler.post(*groups)
File "/base/data/home/apps/picasa2vkontakte/1.348093606241250361/picasa2vkontakte.py", line 109, in post
headers=headers
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/urlfetch.py", line 260, in fetch
return rpc.get_result()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 592, in get_result
return self.__get_result_hook(self)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/urlfetch.py", line 355, in _get_fetch_result
raise DownloadError(str(err))
DownloadError: ApplicationError: 2
How should I perform retries in case of such error?
try:
result = urlfetch.fetch(url=self.request.get('upload_url'),
payload=''.join(data),
method=urlfetch.POST,
headers=headers
)
except DownloadError:
# how to retry 2 more times?
# and how to verify result here?
If you can, move this work into the task queue. When tasks fail, they retry automatically. If they continue to fail, the system gradually backs off retry frequency to as slow as once-per hour. This is an easy way to handle API requests to rate-limited services without implementing one-off retry logic.
If you really need to handle requests synchronously, something like this should work:
for i in range(3):
try:
result = urlfetch.fetch(...)
# run success conditions here
break
except DownloadError:
#logging.debug("urlfetch failed!")
pass
You can also pass deadline=10 to urlfetch.fetch to double the default timeout deadline.