How to handle 60 secs timeout with heavy queries - python

I have to make some heavy queries in my datastore to obtain some high level information. When it reaches the 60 secs I get an error that I suppose its a timeout cut:
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 207, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 140, in xsrf_required_decorator
method(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 348, in post
exec(compiled_code, globals())
File "<string>", line 28, in <module>
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2314, in next
return self.__model_class.from_entity(self.__iterator.next())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 1442, in from_entity
return cls(None, _from_entity=entity, **entity_values)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 958, in __init__
if isinstance(_from_entity, datastore.Entity) and _from_entity.is_saved():
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore.py", line 814, in is_saved
self.__key.has_id_or_name())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore_types.py", line 565, in has_id_or_name
elems = self.__reference.path().element_list()
DeadlineExceededError
This is not an application query, I am interacting with my app through the Interactive Console, so this is not a live problem. My problem is that I have to iterate around all my application users, checking big amounts of data that I need to retrieve for each of them. I could do it one by one by hard coding their user_id, but it would be slow and non-efficient.
Can you guys think of any way I could do this faster? Is there anyway for selecting maybe 5 by five the users, like LIMIT=5 get only the first 5 users, but it would be great if I can get, first the 5 users, after that, the next 5 users and so on, iterating by all of them but with lighter queries. Can I set a longer timeout?
Any way you can think about I could deal with this problem?

You could use a cursor to pick up your search where you left off in conjunction with limit:
Returns a base64-encoded cursor string denoting the position in the query's result set following the last result retrieved. The cursor string is safe to use in HTTP GET and POST parameters, and can also be stored in the Datastore or Memcache. A future invocation of the same query can provide this string via the start_cursor parameter or the with_cursor() method to resume retrieving results from this position.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_cursor

I'd write a simple request handler to do the task.
Either write it in a way that it can be run on mapreduce, or launch a backend to run your handler.

First, by getting your entities in batches will reduce the communication time of your application with the datastore significantly. For details on this, take a look at 10 things you (probably) didn't know about App Engine
Then, you can assign this procedure to Task Queues that enable you to execute tasks up to 10 minutes. For more information on Task Queues, take a look at The Task Queue Python API.
Finally, for tasks that need more time you can also consider the use of Backends. For more information you can take a look at Backends (Python).
Hope this helps.

Related

dask: How do I avoid timeout for a task?

In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?
Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html

Fetching a lot of urls in python with google app engine

In my subclass of RequestHandler, I am trying to fetch range of urls:
class GetStats(webapp2.RequestHandler):
def post(self):
lastpage = 50
for page in range(1, lastpage):
tmpurl = url + str(page)
response = urllib2.urlopen(tmpurl, timeout=5)
html = response.read()
# some parsing html
heap.append(result_of_parsing)
self.response.write(heap)
But it works with ~ 30 urls (page is loading long but it is works).
In case more than 30 I am getting an error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Is there any way to fetch a lot of urls? May be more optimal or smth?
Up to several hundreds of pages?
Update:
I am using BeautifulSoup to parse every single page. I found this traceback in gae logs:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
DeadlineExceededError
It's failing because you only have 60 seconds to return a response to the user and I'm going to guess it's taking longer then that.
You will want to use this: https://cloud.google.com/appengine/articles/deferred
to create a task that has a 10 minute time out. Then you can return instantly to the user and they can "pick up" the results at a later time via another handler (that you create). If collecting all the URLs takes longer then 10 minutes you'll have to split them up into further tasks.
See this: https://cloud.google.com/appengine/articles/deadlineexceedederrors
to understand why you cannot go longer then 60 seconds.
Edit:
Might come from Appengine quotas and limits.
Sorry for previous answer:
As this looks like a protection from server for avoiding ddos or scrapping from one client. You have few options:
Waiting between a certain number of queries before continuing.
Making request from several clients who has different IP address and sending information back to your main script (might be costly to rent different server for this..).
You could also watch if website as api to access the data you need.
You should also take care as the sitowner could block/blacklist your IP if he decides your request are not good.

How to set individual data store per local GAE app instance?

For testing purpose I want to start two instances of a GAE app locally. However the second instance will fail to start because there is already a lock on the local database imposed by the first instance.
INFO 2014-09-28 05:14:22,751 admin_server.py:117] Starting admin server at: http://localhost:8081
OperationalError('database is locked',)
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/cherrypy/cherrypy/wsgiserver/wsgiserver2.py", line 1302, in communicate
req.respond()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/cherrypy/cherrypy/wsgiserver/wsgiserver2.py", line 831, in respond
self.server.gateway(self).respond()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/cherrypy/cherrypy/wsgiserver/wsgiserver2.py", line 2115, in respond
response = self.req.server.wsgi_app(self.env, self.start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/devappserver2/wsgi_server.py", line 266, in __call__
return app(environ, start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/devappserver2/module.py", line 1431, in __call__
return self._handle_request(environ, start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/devappserver2/module.py", line 641, in _handle_request
module=self._module_configuration.module_name)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/apiproxy_stub.py", line 165, in WrappedMethod
return method(self, *args, **kwargs)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/logservice/logservice_stub.py", line 172, in start_request
host, start_time, method, resource, http_version, module))
OperationalError: database is locked
Is there any way I can specify an alternative data store location in the second instance of my app?
Depends on how you start your application.
If using Java, might want to look at this answer.
But keep in mind your two apps won't be talking to the same datastores, so if you need data to persist between your instances, this won't work.

How to lock Google Cloud Storage text file to have transaction like operation

I have a text file (say, "X") stored on GCS and created and updated by GCS Client Library. I use GAE Python. On every addition of some data by user of my website, I add a Task (taskqueue.Task) to the "default" queue to do some actions including modification of file ("X").
Sometimes, I get the following error in the logs:
E 2014-07-20 03:19:06.238 500 3KB 430ms /t
0.1.0.2 - - [19/Jul/2014:14:49:06 -0700] "POST /t HTTP/1.1" 500 2569 "http://www.myappdomain.com/p" "AppEngine-Google; (+http://code.google.com/appengine)" "www.myappdomain.com" ms=430 cpu_ms=498 cpm_usd=0.000287 queue_name=default task_name=14629523467445182169 instance=00c61b117c48b4db44a58e0d454310843e7848 app_engine_release=1.9.7 trace_id=3db3eb580b76133e90947539c0446910
I 03:19:05.813 [class TaskQueueWorker] work=[sitemap_index_entry]
I 03:19:05.813 country_id=[US] country_name=[USA] state_id=[CA] state_name=[California] city_id=[SVL] city_name=[Sunnyvale]
I 03:19:05.836 locality_id_old=[-1] locality_id_new=[28]
I 03:19:05.879 locality_name_old=[] locality_name_new=[XYZ]
I 03:19:05.879 command=[ADD]
E 03:19:06.207 File on GCS has changed while reading.
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/main_v3.py", line 15259, in post
gcs_file = gcs.open (index_filename, mode='r')
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/cloudstorage_api.py", line 94, in open
buffer_size=read_buffer_size)
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/storage_api.py", line 220, in __init__
check_response_closure()
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/storage_api.py", line 448, in _checker
self._check_etag(resp_headers.get('etag'))
File "/base/data/home/apps/s~myappdomain/1.377368272328585247/cloudstorage/storage_api.py", line 476, in _check_etag
raise ValueError('File on GCS has changed while reading.')
ValueError: File on GCS has changed while reading.
I 03:19:06.235 Saved; key: __appstats__:045800, part: 144 bytes, full: 74513 bytes, overhead: 0.002 + 0.004; link: http://www.myappdomain.com/_ah/stats/details?time=1405806545812
I suspect that multiple triggered tasks try to open and update the file ("X") at the same time. And that causes the above exception. Please suggest a way to lock access to that file so that only one task is able to modify it at a time(similar to a transaction).
Appreciate your help and guidance.
UPDATE
Another way to prevent the above problem could be to modify one of the following queue.yaml parameter for the queue:
bucket_size
OR
max_concurrent_requests
But, not sure which one to modify.
A task queue of max_concurrent_requests = 1 should ensure that only one edit is made at a time to a file.
https://developers.google.com/appengine/docs/python/config/queue#Python_Defining_push_queues_and_processing_rates
If you want to prevent too many tasks from running at once or to
prevent datastore contention, you use max_concurrent_requests.
max_concurrent_requests (push queues only)
Sets the maximum number of tasks that can be executed at any given
time in the specified queue. The value is an integer. By default, this
directive is unset and there is no limit on the maximum number of
concurrent tasks. One use of this directive is to prevent too many
tasks from running at once or to prevent datastore contention.
Restricting the maximum number of concurrent tasks gives you more
control over your queue's rate of execution. For example, you can
constrain the number of instances that are running the queue's tasks.
Limiting the number of concurrent requests in a given queue allows you
to make resources available for other queues or online processing.
Of course, you should build in logic that'll allow failed tasks to re-try etc, or you may end up with worse problems then you have now.
It is also possible to rely on GCS itself using preconditions
This allows updating a file only. See the docs:
Preconditions are often used in mutating requests — uploads, deletes, copies, or metadata updates — to prevent race conditions. Race conditions can arise when the same request is sent repeatedly or when independent processes interfere with each other. For example, multiple request retries after a network interruption, or users performing a read-modify-write operation on the same object can create race conditions.

AppEngine 'explicitly cancelled' error

I'm using Google AppEngine and the deferred library, with the Mapper class, as described here (with some improvements as in here). In some iterations of the mapper I get the following error:
CancelledError: The API call datastore_v3.Put() was explicitly cancelled.
The Mapper usually runs fine, I used to have a higher batch size, so that it would actually hit the DeadlineExceededError, and that was handled correctly.
Just to be sure, I reduced the batch_size to a very low number, so that it never even hits the DeadlineExceededError but I still get the CancelledError.
The stack trace is as follows:
File "utils.py", line 114, in _continue
self._batch_write()
File "utils.py", line 76, in _batch_write
db.put(self.to_put)
File "/google/appengine/ext/db/__init__.py", line 1238, in put
keys = datastore.Put(entities, rpc=rpc)
File "/google/appengine/api/datastore.py", line 255, in Put
'datastore_v3', 'Put', req, datastore_pb.PutResponse(), rpc)
File "/google/appengine/api/datastore.py", line 177, in _MakeSyncCall
rpc.check_success()
File "/google/appengine/api/apiproxy_stub_map.py", line 474, in check_success
self.__rpc.CheckSuccess()
File "/google/appengine/api/apiproxy_rpc.py", line 126, in CheckSuccess
raise self.exception
CancelledError: The API call datastore_v3.Put() was explicitly cancelled.
I can't really find a lot of information about this 'explicity cancelled' error, so I was wondering what caused it and how to investigate.
After a DeadlineExceededError, you are allowed a short amount of grace time to handle the exception, eg defer the remainder of the computation.
If you run out of grace time the CancelledError kicks in.
There should be no way to catch/handle the CancelledError

Categories

Resources