Fetching a lot of urls in python with google app engine

Fetching a lot of urls in python with google app engine - python

In my subclass of RequestHandler, I am trying to fetch range of urls:
class GetStats(webapp2.RequestHandler):
def post(self):
lastpage = 50
for page in range(1, lastpage):
tmpurl = url + str(page)
response = urllib2.urlopen(tmpurl, timeout=5)
html = response.read()
# some parsing html
heap.append(result_of_parsing)
self.response.write(heap)
But it works with ~ 30 urls (page is loading long but it is works).
In case more than 30 I am getting an error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Is there any way to fetch a lot of urls? May be more optimal or smth?
Up to several hundreds of pages?
Update:
I am using BeautifulSoup to parse every single page. I found this traceback in gae logs:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
DeadlineExceededError

It's failing because you only have 60 seconds to return a response to the user and I'm going to guess it's taking longer then that.
You will want to use this: https://cloud.google.com/appengine/articles/deferred
to create a task that has a 10 minute time out. Then you can return instantly to the user and they can "pick up" the results at a later time via another handler (that you create). If collecting all the URLs takes longer then 10 minutes you'll have to split them up into further tasks.
See this: https://cloud.google.com/appengine/articles/deadlineexceedederrors
to understand why you cannot go longer then 60 seconds.

Edit:
Might come from Appengine quotas and limits.
Sorry for previous answer:
As this looks like a protection from server for avoiding ddos or scrapping from one client. You have few options:
Waiting between a certain number of queries before continuing.
Making request from several clients who has different IP address and sending information back to your main script (might be costly to rent different server for this..).
You could also watch if website as api to access the data you need.
You should also take care as the sitowner could block/blacklist your IP if he decides your request are not good.

Related

Python SocketIO KeyError: 'Session is disconnected'

On a small Flask webserver running on a RaspberryPi with about 10-20 clients, we periodically get this error:
Error on request:
Traceback (most recent call last):
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/werkzeug/serving.py", line 270, in run_wsgi
execute(self.server.app)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/werkzeug/serving.py", line 258, in execute
application_iter = app(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/flask/app.py", line 2309, in __call__
return self.wsgi_app(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/flask_socketio/__init__.py", line 43, in __call__
start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/engineio/middleware.py", line 47, in __call__
return self.engineio_app.handle_request(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/socketio/server.py", line 360, in handle_request
return self.eio.handle_request(environ, start_response)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/engineio/server.py", line 291, in handle_request
socket = self._get_socket(sid)
File "/home/pi/3D_printer_control/env/lib/python3.7/site-packages/engineio/server.py", line 427, in _get_socket
raise KeyError('Session is disconnected')
KeyError: 'Session is disconnected'
The error is generated automatically from inside python-socketio. What does this error really mean and how can I prevent or suppress it?

As far as I can tell, this usually means the server can't keep up with supplying data to all of the clients.
Some possible mitigation techniques include disconnecting inactive clients, reducing the amount of data sent where possible, sending live data in larger chunks, or upgrading the server. If you need a lot of data throughput, there may be also be a better option than socketIO.
I have been able to reproduce it by setting a really high ping rate and low timeout in the socketIO constructor:
from flask_socketio import SocketIO
socketio = SocketIO(engineio_logger=True, ping_timeout=5, ping_interval=5)
This means the server has to do a lot of messaging to all of the clients and they don't have long to respond. I then open around 10 clients and I start to see the KeyError.
Further debugging of our server found a process that was posting lots of live data which ran fine with only a few clients but starts to issue the occasional KeyError once I get up to about a dozen.

Python: Get queue and emit messages via Flask-socketio, is this possible with a While True?

I have some working code, that gets data from a queue, processes it and then emits the data via Flask-socketio to the browser.
This works when there aren't many messages to emit, however when the workload increases it simply can't cope.
Additionally I have tried processing the queue without transmitting and this appears to work correctly.
So what I was hoping to do, is rather than emit every time the queue.get() fires, to simply emit whatever the current dataset is on every ping.
In theory this should mean that even if the queue.get() fires multiple times between ping and pong, the messages being sent should remain constant and won't overload the system.
It means of course, some data will not be sent, but the most up-to date data as of the last ping should be sent, which is sufficient for what I need.
Hopefully that makes sense, so on to the (sort of working) code...
This works when there are not a lot of messages to emit (the socketio sleep needs to be there, otherwise the processing doesn't occur before it tries to emit):
def handle_message(*_args, **_kwargs):
try:
order_books = order_queue.get(block=True, timeout=1)
except Empty:
order_books = None
market_books = market_queue.get()
uo = update_orders(order_books, mkt_runners, profitloss, trading, eo, mb)
update_market_book(market_books, mb, uo, market_catalogues, mkt_runners, profitloss, trading)
socketio.sleep(0.2)
emit('initial_market', {'message': 'initial_market', 'mb': json.dumps(mb), 'ob': json.dumps(eo)})
socketio.sleep(0.2)
emit('my_response2', {'message': 'pong'})
def main():
socketio.run(app, debug=True, port=3000)
market_stream.stop()
order_stream.stop()
if __name__ == '__main__':
main()
This works (but I'm not trying to emit any messages here, this is just the Python script getting from the queue and processing):
while True:
try:
order_books = order_queue.get(block=True, timeout=1)
except Empty:
order_books = None
market_books = market_queue.get()
uo = update_orders(order_books, mkt_runners, profitloss, trading, eo, mb)
update_market_book(market_books, mb, uo, market_catalogues, mkt_runners, profitloss, trading)
print(mb)
(The mb at the end is the current dataset, which is returned by the update_market_book function).
Now, I was hoping, by having the While True at the end, this could simply run and the function would return the latest dataset on every ping, however using the above, the While True only fires when the main function is taken out...which of course stops the socketio section from working.
So, is there a way I can combine both these, to achieve what I am trying to do and/or is there an alternative method I haven't considered that might work?
As always, I appreciate your advice and if the question is not clear, please let me know, so I can clarify any bits.
Many thanks!
Just adding a stack trace as requested:
Traceback (most recent call last):
File "D:\Python37\lib\site-packages\betfairlightweight\endpoints\login.py", line 38, in request
response = session.post(self.url, data=self.data, headers=self.client.login_headers, cert=self.client.cert)
File "D:\Python37\lib\site-packages\requests\api.py", line 116, in post
return request('post', url, data=data, json=json, **kwargs)
File "D:\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "D:\Python37\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "D:\Python37\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "D:\Python37\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "D:\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "D:\Python37\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "D:\Python37\lib\site-packages\urllib3\connectionpool.py", line 839, in _validate_conn
conn.connect()
File "D:\Python37\lib\site-packages\urllib3\connection.py", line 332, in connect
cert_reqs=resolve_cert_reqs(self.cert_reqs),
File "D:\Python37\lib\site-packages\urllib3\util\ssl_.py", line 279, in create_urllib3_context
context.options |= options
File "D:\Python37\lib\ssl.py", line 507, in options
super(SSLContext, SSLContext).options.__set__(self, value)
File "D:\Python37\lib\ssl.py", line 507, in options
super(SSLContext, SSLContext).options.__set__(self, value)
File "D:\Python37\lib\ssl.py", line 507, in options
super(SSLContext, SSLContext).options.__set__(self, value)
[Previous line repeated 489 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

How to set individual data store per local GAE app instance?

For testing purpose I want to start two instances of a GAE app locally. However the second instance will fail to start because there is already a lock on the local database imposed by the first instance.
INFO 2014-09-28 05:14:22,751 admin_server.py:117] Starting admin server at: http://localhost:8081
OperationalError('database is locked',)
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/cherrypy/cherrypy/wsgiserver/wsgiserver2.py", line 1302, in communicate
req.respond()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/cherrypy/cherrypy/wsgiserver/wsgiserver2.py", line 831, in respond
self.server.gateway(self).respond()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/cherrypy/cherrypy/wsgiserver/wsgiserver2.py", line 2115, in respond
response = self.req.server.wsgi_app(self.env, self.start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/devappserver2/wsgi_server.py", line 266, in __call__
return app(environ, start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/devappserver2/module.py", line 1431, in __call__
return self._handle_request(environ, start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/devappserver2/module.py", line 641, in _handle_request
module=self._module_configuration.module_name)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/apiproxy_stub.py", line 165, in WrappedMethod
return method(self, *args, **kwargs)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/logservice/logservice_stub.py", line 172, in start_request
host, start_time, method, resource, http_version, module))
OperationalError: database is locked
Is there any way I can specify an alternative data store location in the second instance of my app?

Depends on how you start your application.
If using Java, might want to look at this answer.
But keep in mind your two apps won't be talking to the same datastores, so if you need data to persist between your instances, this won't work.

App Engine to GCS | Uploading Issue

I have some files that I'm receiving from the Evernote API (via getResource) and writing to Google Cloud Storage with the following code:
gcs_file = gcs.open(filename, 'w', content_type=res.mime,
retry_params=write_retry_params)
# Retrieve the binary data and write to GCS
resource_file = note_store.getResource(res.guid, True, False, False, False)
gcs_file.write(resource_file.data.body)
gcs_file.close()
For even some types of documents, it still works. But there are certain documents that GCS throws this in the logs:
Unable to fetch URL: https://storage.googleapis.com/evernoteresources/5db799f1-c03c-4056-812a-6d77bad55261/Sleep Away.mp3
and
Got exception while contacting GCS. Will retry in 0.11 seconds.
There doesn't seem to be any pattern to these errors. It happens with documents, sounds, pictures, whatever - some of these document types work and some don't. It isn't due to size (since some small work and some large do).
Any ideas?
Here's the full stack trace, though I'm not sure it will help.
Encountered unexpected error from ProtoRPC method implementation: TimeoutError (('Request to Google Cloud Storage timed out.', DownloadError('Unable to fetch URL: https://storage.googleapis.com/evernoteresources/78413585-2266-4426-b08c-71d6c224f266/Evernote Snapshot 20130512 124546.jpg',)))
Traceback (most recent call last):
File "/python27_runtime/python27_lib/versions/1/protorpc/wsgi/service.py", line 181, in protorpc_service_app
response = method(instance, request)
File "/python27_runtime/python27_lib/versions/1/google/appengine/ext/endpoints/api_config.py", line 972, in invoke_remote
return remote_method(service_instance, request)
File "/python27_runtime/python27_lib/versions/1/protorpc/remote.py", line 412, in invoke_remote_method
response = method(service_instance, request)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/endpoints.py", line 61, in get_note_details
url = tools.registerResource(note_store, req.note_guid, r)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/GlobalUtilities.py", line 109, in registerResource
retry_params=write_retry_params)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/cloudstorage_api.py", line 69, in open
return storage_api.StreamingBuffer(api, filename, content_type, options)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/storage_api.py", line 526, in __init__
status, headers, _ = self._api.post_object(path, headers=headers)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/rest_api.py", line 41, in sync_wrapper
return future.get_result()
File "/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 325, in get_result
self.check_success()
File "/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 368, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/base/data/home/apps/s~quinector/2a.368528733040360018/cloudstorage/storage_api.py", line 84, in do_request_async
'Request to Google Cloud Storage timed out.', e)
TimeoutError: ('Request to Google Cloud Storage timed out.', DownloadError('Unable to fetch URL: https://storage.googleapis.com/evernoteresources/78413585-2266-4426-b08c-71d6c224f266/Evernote Snapshot 20130512 124546.jpg',))

This is a bug in the gcs client code. It should properly handle the filename. The fact it is using http request to GCS should be "hidden". This will be fixed soon. Thanks!
Note if you quote the filename yourself to work around this bug, the filename will be double quoted after the fix. Sorry.

Thank you Brian! The problem was the spaces in the filenames. I just used urllib2.quote() to get those out of there and it works like a charm.

How to handle 60 secs timeout with heavy queries

I have to make some heavy queries in my datastore to obtain some high level information. When it reaches the 60 secs I get an error that I suppose its a timeout cut:
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 207, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 140, in xsrf_required_decorator
method(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/admin/__init__.py", line 348, in post
exec(compiled_code, globals())
File "<string>", line 28, in <module>
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2314, in next
return self.__model_class.from_entity(self.__iterator.next())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 1442, in from_entity
return cls(None, _from_entity=entity, **entity_values)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 958, in __init__
if isinstance(_from_entity, datastore.Entity) and _from_entity.is_saved():
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore.py", line 814, in is_saved
self.__key.has_id_or_name())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/datastore_types.py", line 565, in has_id_or_name
elems = self.__reference.path().element_list()
DeadlineExceededError
This is not an application query, I am interacting with my app through the Interactive Console, so this is not a live problem. My problem is that I have to iterate around all my application users, checking big amounts of data that I need to retrieve for each of them. I could do it one by one by hard coding their user_id, but it would be slow and non-efficient.
Can you guys think of any way I could do this faster? Is there anyway for selecting maybe 5 by five the users, like LIMIT=5 get only the first 5 users, but it would be great if I can get, first the 5 users, after that, the next 5 users and so on, iterating by all of them but with lighter queries. Can I set a longer timeout?
Any way you can think about I could deal with this problem?

You could use a cursor to pick up your search where you left off in conjunction with limit:
Returns a base64-encoded cursor string denoting the position in the query's result set following the last result retrieved. The cursor string is safe to use in HTTP GET and POST parameters, and can also be stored in the Datastore or Memcache. A future invocation of the same query can provide this string via the start_cursor parameter or the with_cursor() method to resume retrieving results from this position.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_cursor

I'd write a simple request handler to do the task.
Either write it in a way that it can be run on mapreduce, or launch a backend to run your handler.

First, by getting your entities in batches will reduce the communication time of your application with the datastore significantly. For details on this, take a look at 10 things you (probably) didn't know about App Engine
Then, you can assign this procedure to Task Queues that enable you to execute tasks up to 10 minutes. For more information on Task Queues, take a look at The Task Queue Python API.
Finally, for tasks that need more time you can also consider the use of Backends. For more information you can take a look at Backends (Python).
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fetching a lot of urls in python with google app engine - python

Related

Python SocketIO KeyError: 'Session is disconnected'

Python: Get queue and emit messages via Flask-socketio, is this possible with a While True?

How to set individual data store per local GAE app instance?

App Engine to GCS | Uploading Issue

How to handle 60 secs timeout with heavy queries

Categories

Resources