Handling GAE BlobStore Exception via webapp2 handler

Handling GAE BlobStore Exception via webapp2 handler - python

I have been banging my head around on this issue for a bit and have not come up with a solution. I am attempting to trap the exception UploadEntityTooLargeEntity. This exception is raised by GAE when 2 things happen.
Set the max_bytes_total param in the create_upload_url:
self.template_values['AVATAR_SAVE_URL'] = blobstore.create_upload_url('/saveavatar,
max_bytes_total= 524288)
Attempt to post an item that exceeds the max_bytes_total.
I expect that, since my class is derived from RequestHandler that my error() method would be called. Instead I am getting a 413 screen telling me the upload is too large.
My request handler is derived from webapp2.RequestHandler. Is it expected that GAE will work with the error method derived from webapp2.RequestHandler? I'm not seeing this in GAE's code but I can't imagine there would be such an omission.

The 413 is generated by the App Engine infrastructure; the request neve reaches your app, so it's impossible to handle this condition yourself.

Related

Why is python requests not terminating and why are these seperate logs printed?

I am running a job which makes many requests to retrieve data from an API. In order to make the requests, I am using the requests module and an iteration over this code:
logger.debug("Some log message")
response = requests.get(
url=self._url,
headers=self.headers,
auth=self.auth,
)
logger.debug("Some other log message")
This usually produces the following logs:
[...] Some log message
[2019-08-27 03:00:57,201 - DEBUG - connectionpool.py:393] https://my.url.com:port "GET /some/important/endpoint?$skiptoken='12345' HTTP/1.1" 401 0
[2019-08-27 03:00:57,601 - DEBUG - connectionpool.py:393] https://my.url.com:port "GET /some/important/endpoint?$skiptoken='12345' HTTP/1.1" 200 951999
[...] Some other log message
In very rare occasions however, the job never terminates and in the logs it says:
[...] Some log message
[2019-08-27 03:00:57,201 - DEBUG - connectionpool.py:393] https://my.url.com:port "GET /some/important/endpoint?$skiptoken='12345' HTTP/1.1" 401 0
It never printed the remaining log messages and never returned. I am not able to reproduce the issue. I made the request which never returned manually but it gave me the desired response.
Questions:
Why does urllib3 always print a log with a status code 401 before printing a log with status code 200? Is this always the case or caused by an issue with the authentication or with the API server?
In the rare case of the second log snipped, is my assumption correct, that the application is stuck making a request which never returns? Or:
a) Could the requests.get throw an exception which results in the other log statements to never be printed and then is "magically" get caught somewhere in my code?
b) Is there a different possibility which I have not realised?
Additional Information:
Python 2.7.13 (We are already in the middle of upgrading to Python3, but this needs to be solved before that is completed)
requests 2.21.0
urllib3 1.24.3
auth is passed a requests.auth.HTTPDigestAuth(username, password)
My code has no try/except block which is why I wrote "magically" in Question 2.a. This is because we would prefer the job to Fail "loudly".
I am iterating over a generator yielding urls in order to make multiple requests
The job is run by
Jenkins 2.95 on a schedule
When everything runs successfully it makes around 300 requests in about 5min
I am running two python scripts both running the same code but against different endpoints in one job but in parallel
Update
Answer to Q1:
This seems to be expected behaviour for HTTP Digest Auth.
See this github issue and Wikipedia.

To answer your question,
1. Seems like its a problem from your API. To make sure can you run a curl command and see?
curl -i https://my.url.com:port/some/important/endpoint?$skiptoken='12345'
It never terminates, maybe due to that API is not responding back. Add timeout to avoid this kind of blocks.
response = requests.get(
url=self._url,
headers=self.headers,
auth=self.auth,
timeout=60
)
Hope this helps your problem.

As Vithulan already answered, you should always set a timeout value when doing network calls - unless you don't care about your process staying stuck forever that is...
Now wrt/ error handling etc:
a) Could the requests.get throw an exception which results in the
other log statements to never be printed and then is "magically" get
caught somewhere in my code?
It's indeed possible that some other try/except block upper in the call stack swallows the exception, but only you can tell. Note that if that's the case, you have some very ill-behaved code - a try/except should 1/ only target the exact exception(s) it's supposed to handle, 2/ have the least possible code within the try block to avoid catching similar errors from another part of the code and 3/ never silence an exception (IOW it should at least log the exception and traceback).
Note that you might as well just have a deactivated logger FWIW ;-)
This being said and until you made sure you didn't have such an issue, you can still get more debugging info by logging requests exceptions in your function:
logger.debug("Some log message")
try:
response = requests.get(
url=self._url,
headers=self.headers,
auth=self.auth,
timeout=SOME_TIMEOUT_VALUE
)
except Exception as e:
# this will log the full traceback too
logger.exception("oops, call to %s failed : %s", self._url, e)
# make sure we don't swallow the exception
raise
logger.debug("Some other log message")
Now a fact of life is that HTTP requests can fail for such an awful lot of reasons that you should actually expect it to fail, so you may want to have some retry mechanism. Also, the fact that the call to requests.get didn't raise doesn't mean the call failed - you still have to check the response code (or use response.raise_for_status()).
EDIT:
As mentioned in my question, my code has no try/except block because we would like the entire job to terminate if any problem occurs.
A try/except block doesn't prevent you from terminating the job - just re-raise the exception (eventually after X retries), or raise a new one instead, or call sys.exit() (which actually works by raising an exception) -, and it lets you get useful debugging infos etc, cf my example code.
If there is an issue with the logger, this would then only occur in rare occasions. I can not imagine a scenario where the same code is run but sometimes the loggers are activated and sometimes not.
I was talking about another logger upper in the call stack. But this was only for completeness, I really think you just have a request that never returns for the lack of a timeout.
Do you know why I am noticing the Issue I talk about in Question 1?
Nope, and that's actually something I'd immediatly investigate since, AFAICT, for a same request, you should either have only the 401 or only the 200.
According to the RFC:
10.4.2 401 Unauthorized
The request requires user authentication. The response MUST include a WWW-Authenticate header
field (section 14.47) containing a challenge applicable to the
requested resource. The client MAY repeat the request with a suitable
Authorization header field (section 14.8).
If the request already included Authorization credentials, then the 401 response indicates
that authorization has been refused for those credentials. If the 401
response contains the same challenge as the prior response, and the
user agent has already attempted authentication at least once, then
the user SHOULD be presented the entity that was given in the
response, since that entity might include relevant diagnostic
information.
So unless requests does something weird with auth headers (which is not the fact as far as I remember but...), you should only have one single response logged.
EDIT 2:
I wanted to say that if an exception is thrown but not explicitly caught by my code, it should terminate the job (which was the case in some tests I ran)
If an exception reaches the top of the call stack without being handled, the runtime will indeed terminate the process - but you have to be sure that no handler up the call stack kicks in and swallows the exception. Testing the function in isolation will not exhibit this issue, so you'd have to check the full call stack.
This being said:
The fact, that it does not terminate, suggests to me, that no exception is thrown.
That's indeed the most likely, but only you can make sure it's really the case (we don't know the full code, the logger configuration etc).

Django strange behaviour when an exception occurs, using django-rest-framework

I'm developing some REST api for a webservice.
Starting from some days I've experienced a big problem that is blocking me.
When the code has an exception (during developing) the django server respond only after 5/8 or 10 minutes... with the error that occurs.
To understand what is happening I've started the server in debug using pycharm.... and then clicking on pause during the big waiting.. the code is looping here into python2.7/SocketServer.py
def _eintr_retry(func, *args):
"""restart a system call interrupted by EINTR"""
while True:
try:
return func(*args)
except (OSError, select.error) as e:
if e.args[0] != errno.EINTR:
raise
print(foo)
What can I do? I'm pretty desperate!

Sometimes that happens in Django debug mode because Django generates a nice page with traceback and a list of all local variables for every stack frame.
When Django models has an inefficient __str__ (__unicode__ since you're using Python 2) method, this can result in Django loading many thousands of objects from the database to display traceback.
In my experience this is the only reason for a very long pauses on exception in Django. Try running with DEBUG = False or check what exactly model has inefficient __str__ method that hits database.

How should I access the args of a Splash request when parsing the response on a 400 error?

I'm sending through some arguments to a Splash endpoint from a Scrapy crawler to be used by a Splash script that will be run. Sometimes errors may occur in that script. For runtime errors I use pcall to wrap the questionable script so I can gracefully catch and return run time errors. For syntax errors however, this doesn't work and instead a 400 error is thrown. I set my crawler to handle such an errors with the handle_httpstatus_list attribute, so my parse callback is being called on such an error, where I can inspect what went wrong gracefully.
So far so good, except since I couldn't handle the syntax error gracefully in the lua script I wasn't able to return some of the input Splash args that the callback will be expecting to access. Right now I'm calling response._splash_args() on the SplashJsonResponse object that I do have and this is allowing me to access those values. However this is essentially a protected method which means it might not be a long term solution.
Is there a better way to access the Splash args in the response when you can't rely on the associated splash script running at all?

Invalid transaction persisting across requests

Summary
One of our threads in production hit an error and is now producing InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction. errors, on every request with a query that it serves, for the rest of its life! It's been doing this for days, now! How is this possible, and how can we prevent it going forward?
Background
We are using a Flask app on uWSGI (4 processes, 2 threads), with Flask-SQLAlchemy providing us DB connections to SQL Server.
The problem seemed to start when one of our threads in production was tearing down its request, inside this Flask-SQLAlchemy method:
#teardown
def shutdown_session(response_or_exc):
if app.config['SQLALCHEMY_COMMIT_ON_TEARDOWN']:
if response_or_exc is None:
self.session.commit()
self.session.remove()
return response_or_exc
...and somehow managed to call self.session.commit() when the transaction was invalid. This resulted in sqlalchemy.exc.InvalidRequestError: Can't reconnect until invalid transaction is rolled back getting output to stdout, in defiance of our logging configuration, which makes sense since it happened during the app context tearing down, which is never supposed to raise exceptions. I'm not sure how the transaction got to be invalid without response_or_exec getting set, but that's actually the lesser problem AFAIK.
The bigger problem is, that's when the "'prepared' state" errors started, and haven't stopped since. Every time this thread serves a request that hits the DB, it 500s. Every other thread seems to be fine: as far as I can tell, even the thread that's in the same process is doing OK.
Wild guess
The SQLAlchemy mailing list has an entry about the "'prepared' state" error saying it happens if a session started committing and hasn't finished yet, and something else tries to use it. My guess is that the session in this thread never got to the self.session.remove() step, and now it never will.
I still feel like that doesn't explain how this session is persisting across requests though. We haven't modified Flask-SQLAlchemy's use of request-scoped sessions, so the session should get returned to SQLAlchemy's pool and rolled back at the end of the request, even the ones that are erroring (though admittedly, probably not the first one, since that raised during the app context tearing down). Why are the rollbacks not happening? I could understand it if we were seeing the "invalid transaction" errors on stdout (in uwsgi's log) every time, but we're not: I only saw it once, the first time. But I see the "'prepared' state" error (in our app's log) every time the 500s occur.
Configuration details
We've turned off expire_on_commit in the session_options, and we've turned on SQLALCHEMY_COMMIT_ON_TEARDOWN. We're only reading from the database, not writing yet. We're also using Dogpile-Cache for all of our queries (using the memcached lock since we have multiple processes, and actually, 2 load-balanced servers). The cache expires every minute for our major query.
Updated 2014-04-28: Resolution steps
Restarting the server seems to have fixed the problem, which isn't entirely surprising. That said, I expect to see it again until we figure out how to stop it. benselme (below) suggested writing our own teardown callback with exception handling around the commit, but I feel like the bigger problem is that the thread was messed up for the rest of its life. The fact that this didn't go away after a request or two really makes me nervous!

Edit 2016-06-05:
A PR that solves this problem has been merged on May 26, 2016.
Flask PR 1822
Edit 2015-04-13:
Mystery solved!
TL;DR: Be absolutely sure your teardown functions succeed, by using the teardown-wrapping recipe in the 2014-12-11 edit!
Started a new job also using Flask, and this issue popped up again, before I'd put in place the teardown-wrapping recipe. So I revisited this issue and finally figured out what happened.
As I thought, Flask pushes a new request context onto the request context stack every time a new request comes down the line. This is used to support request-local globals, like the session.
Flask also has a notion of "application" context which is separate from request context. It's meant to support things like testing and CLI access, where HTTP isn't happening. I knew this, and I also knew that that's where Flask-SQLA puts its DB sessions.
During normal operation, both a request and an app context are pushed at the beginning of a request, and popped at the end.
However, it turns out that when pushing a request context, the request context checks whether there's an existing app context, and if one's present, it doesn't push a new one!
So if the app context isn't popped at the end of a request due to a teardown function raising, not only will it stick around forever, it won't even have a new app context pushed on top of it.
That also explains some magic I hadn't understood in our integration tests. You can INSERT some test data, then run some requests and those requests will be able to access that data despite you not committing. That's only possible since the request has a new request context, but is reusing the test application context, so it's reusing the existing DB connection. So this really is a feature, not a bug.
That said, it does mean you have to be absolutely sure your teardown functions succeed, using something like the teardown-function wrapper below. That's a good idea even without that feature to avoid leaking memory and DB connections, but is especially important in light of these findings. I'll be submitting a PR to Flask's docs for this reason. (Here it is)
Edit 2014-12-11:
One thing we ended up putting in place was the following code (in our application factory), which wraps every teardown function to make sure it logs the exception and doesn't raise further. This ensures the app context always gets popped successfully. Obviously this has to go after you're sure all teardown functions have been registered.
# Flask specifies that teardown functions should not raise.
# However, they might not have their own error handling,
# so we wrap them here to log any errors and prevent errors from
# propagating.
def wrap_teardown_func(teardown_func):
#wraps(teardown_func)
def log_teardown_error(*args, **kwargs):
try:
teardown_func(*args, **kwargs)
except Exception as exc:
app.logger.exception(exc)
return log_teardown_error
if app.teardown_request_funcs:
for bp, func_list in app.teardown_request_funcs.items():
for i, func in enumerate(func_list):
app.teardown_request_funcs[bp][i] = wrap_teardown_func(func)
if app.teardown_appcontext_funcs:
for i, func in enumerate(app.teardown_appcontext_funcs):
app.teardown_appcontext_funcs[i] = wrap_teardown_func(func)
Edit 2014-09-19:
Ok, turns out --reload-on-exception isn't a good idea if 1.) you're using multiple threads and 2.) terminating a thread mid-request could cause trouble. I thought uWSGI would wait for all requests for that worker to finish, like uWSGI's "graceful reload" feature does, but it seems that's not the case. We started having problems where a thread would acquire a dogpile lock in Memcached, then get terminated when uWSGI reloads the worker due to an exception in a different thread, meaning the lock is never released.
Removing SQLALCHEMY_COMMIT_ON_TEARDOWN solved part of our problem, though we're still getting occasional errors during app teardown during session.remove(). It seems these are caused by SQLAlchemy issue 3043, which was fixed in version 0.9.5, so hopefully upgrading to 0.9.5 will allow us to rely on the app context teardown always working.
Original:
How this happened in the first place is still an open question, but I did find a way to prevent it: uWSGI's --reload-on-exception option.
Our Flask app's error handling ought to be catching just about anything, so it can serve a custom error response, which means only the most unexpected exceptions should make it all the way to uWSGI. So it makes sense to reload the whole app whenever that happens.
We'll also turn off SQLALCHEMY_COMMIT_ON_TEARDOWN, though we'll probably commit explicitly rather than writing our own callback for app teardown, since we're writing to the database so rarely.

A surprising thing is that there's no exception handling around that self.session.commit. And a commit can fail, for example if the connection to the DB is lost. So the commit fails, session is not removed and next time that particular thread handles a request it still tries to use that now-invalid session.
Unfortunately, Flask-SQLAlchemy doesn't offer any clean possibility to have your own teardown function. One way would be to have the SQLALCHEMY_COMMIT_ON_TEARDOWN set to False and then writing your own teardown function.
It should look like this:
#app.teardown_appcontext
def shutdown_session(response_or_exc):
try:
if response_or_exc is None:
sqla.session.commit()
finally:
sqla.session.remove()
return response_or_exc
Now, you will still have your failing commits, and you'll have to investigate that separately... But at least your thread should recover.

Appengine Appstats giving deadlock

i have a Flask app on gae, it is working correctly. I am trying to add Appstats support, but once i enable it, i have a deadlock.
This deadlock is apparently happening when i try to setup a werkzeug LocalProxy with the logged user ndb model (it is called current_user, like it's done in Flask-Login, to give you more details).
The error is:
RuntimeError: Deadlock waiting for <Future 104c02f50 created by get_async(key.py:545) for tasklet get(context.py:612) suspended generator get(context.py:645); pending>
The LocalProxy object is setup using this syntax (as per Werkzeug doc):
current_user = LocalProxy(lambda: _get_user())
And _get_user() makes a simple synchronous query ndb.query.
Thanks in advance for any help.

I ran into this issue today. In my case it seems to be that the request to get a users details is triggering appstats. Appstats is then going through the call stack and storing details of all the local variables in each stack frame.
The session itself is in one of these stack frames, so appstats tries to print it out and triggers the user fetching code again.
Came up with 2 "solutions", though neither of them are great.
Disable appstats altogether.
Disable logging of local variables in appstats.
I've gone for the latter at the moment. appstats allows you to configure various settings in your appengine_config.py file. I was able to avoid logging of local variable details (which stops the code from triggering the bug) by adding this:
appstats_MAX_LOCALS = 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.