I have a Django application (API) running in production served by uWSGI, which has 8 processes (workers) running. To monitor them I use uwsgitop. Every day from time to time one worker falls into the BUSY state and stays for like five minutes and consumes all of the memory and kills the whole instance. The problem is, I do not know how to debug what the worker is doing at the particular moment or what function is it executing. Is there a fast and a proper way to find out the function and the request that it is handling?
One can send signal SIGUSR2 to a uwsgi worker, and the current request is printed into the log file, along with a native (sadly not Python) backtrace.
Related
I have a process in which I need to assign long running tasks amongst a pool of workers, in python. So far I have been using RabbitMQ to queue the tasks (input is a nodejs frontend); a python worker subscribes to the queue, obtains a task and executes it. Each task takes several minutes minimum.
After an update this process started breaking, and I eventually discovered this was due to RabbitMQ version 3.6.10 having changed the way it handles timeouts. I now believe I need to rethink my method of assigning tasks, but I want to make sure I do it the right way.
Until now I only had one worker (the task is to control a sequence of actions in a VM - I couldn't afford a new Windows license for a while, so until recently I had no practical way of testing parallel task execution); I suspect if I'd had two before I would have noticed this sooner. The worker attaches to a VM using libvirt to control it. The way my code is written currently implies that I would run one instance of the script per VM that I wish to control.
I suspect that part of my problem is the use of BlockingConnection - I think I need a way for the worker to disconnect from the queue when it has received and validated a task (this part takes less than 1 sec), then reconnect once it has completed the actions, but I haven't figured out how to do this yet. Is this correct? If so, how should I do this, and if not, what should I do instead?
One other idea I've had is that instead of running a script per VM I could have a global control script that on receiving a task would spin off a thread which would handle the task. This would solve the problem of the connection timing out during task execution, but the timeout would just have moved to a different stage: I would potentially receive tasks while there were no idle VMs, and I would have to come up with a way to make the script await an available VM without breaking the RabbitMQ connection.
My current code can be seen here:
https://github.com/scherma/antfarm/blob/master/src/runmanager/runmanager.py#L342
Any thoughts folks?
Is there a good tutorial for how to properly write a backend/module for GAE to handle shutdowns?
This is the error I am getting:
2014-04-09 12:15:44.726 Process terminated because the backend took too long to shutdown.
I have a process that will take a few hours, and I know that I'll have to basically save the state into the memcache, and then restart it.
Are there tutorials for:
1) how to handle a shutdown request
2) how to save to memcache
3) how to restart a module
1) and 2) seem straightforward... restarting a module is something I'm unsure about. My module starts as a cron job, but is there a way to use a shutdown request to trigger another instance of my module to start?
If you have big jobs operating on a large amount of data, then you might look into map reduce.
Anyhow, you should break your large job down to smaller idempotent tasks. Idempotent (without side effects) basically means you can rerun a task getting the same results.
Once you have smaller tasks you can choose to schedule them via Task Queue or use a map reduce framework.
Things to note:
You are not guaranteed to get a shutdown callback. Our backend instances are restarted every day and our shutdown hooks do not get called.
Memcache is not reliable. Do not treat it as permanent storage.
I'm trying to figure that best way to keep a zeroMQ listener running forever in my django app.
I'm setting up a zmq server app in my Django project that acts as internal API to other applications in our network (no need to go through http/requests stuff since these apps are internal). I want the zmq listener inside of my django project to always be alive.
I want the zmq listener in my Django project so I have access to all of the projects models (for querying) and other django context things.
I'm currently thinking:
Set up a Django management command that will run the listener and keep it alive forever (aka infinite loop inside the zmq listener code) or
use a celery worker to always keep the zmq listener alive? But I'm not exactly sure on how to get a celery worker to restart a task only if it's not running. All the celery docs are about frequency/delayed running. Or maybe I should let celery purge the task # a given interval & restart it anyways..
Any tips, advice on performance implications or alternate approaches?
Setting up a management command is a fine way to do this, especially if you're running on your own hardware.
If you're running in a cloud, where a machine may disappear along with your process, then the latter is a better option. This is how I've done it:
Setup a periodic task that runs every N seconds (you need celerybeat running somewhere)
When the task spawns, it first checks a shared network resource (redis, zookeeper, or a db), to see if another process has an active/valid lease. If one exists, abort.
If there's no valid lease, obtain your lease (beware of concurrency here!), and start your infinite loop, making sure you periodically renew the lease.
Add instrumentation so that you know who, where the process is running.
Start celery workers on multiple boxes, consuming from the same queue your periodic task is designated for.
The second solution is more complex and harder to get right; so if you can, a singleton is great and consider using something like supervisord to ensure the process gets restarted if it faults for some reason.
Lets say I have 100 servers each running a daemon - lets call it server - that server is responsible for spawning a thread for each user of this particular service (lets say 1000 threads per server). Every N seconds each thread does something and gets information for that particular user (this request/response model cannot be changed). The problem I a have is sometimes a thread hangs and stops doing something. I need some way to know that users data is stale, and needs to be refreshed.
The only idea I have is every 5N seconds have the thread update a MySQL record associated with that user (a last_scanned column in the users table), and another process that checks that table every 15N seconds, if the last_scanned column is not current, restart the thread.
The general way to handle this is to have the threads report their status back to the server daemon. If you haven't seen a status update within the last 5N seconds, then you kill the thread and start another.
You can keep track of the current active threads that you've spun up in a list, then just loop through them occasionally to determine state.
You of course should also fix the errors in your program that are causing threads to exit prematurely.
Premature exits and killing a thread could also leave your program in an unexpected, non-atomic state. You should probably also have the server daemon run a cleanup process that makes sure any items in your queue, or whatever you're using to determine the workload, get reset after a certain period of inactivity.
I'm running a Django application on Apache with mod_wsgi. Will there be any downtime during an upgrade?
Mod_wsgi is running in daemon mode, so I can reload my code by touching the .wsgi script file, as described in the "ReloadingSourceCode" document: http://code.google.com/p/modwsgi/wiki/ReloadingSourceCode. Presumably, that reload requires some non-zero amount of time. What happens if a request comes in during the reload? Will Apache queue the request and then complete it once the wsgi daemon is ready?
The documentation includes the following statement:
So, if you are using Django in daemon mode and needed to change your 'settings.py' file, once you have made the required change, also touch the script file containing the WSGI application entry point. Having done that, on the next request the process will be restarted and your Django application reloaded.
To me, that suggests that Apache will gracefully handle every request, but I thought I would ask to be sure. My app isn't critical (a little downtime wouldn't be disastrous) so the question is mostly academic.
Thank you.
In daemon mode there is no concept of a graceful restart when WSGI script file is touched to force a download. That is, unlike Apache itself, which will start new Apache server child processes while waiting for old processes to finish up with current requests, for mod_wsgi daemon processes, the existing process must exit before a new one starts up.
The consequences of this are that mod_wsgi can't wait indefinitely for current requests to complete. If it did, then there is a risk that if all daemon processes are tied up waiting for current requests to finish, that clients would see a noticeable delay in being handled.
At the other end of the scale however, the daemon process can't be immediately killed as that would cause current requests to be interrupted.
A middle ground therefore exists. The daemon process will wait for requests to finish before exiting, but if they haven't completed within the shutdown period, then the daemon process will be forcibly quit and the active requests will be interrupted.
The period of this shutdown timeout defaults to 5 seconds. It can be overridden using the shutdown-timeout option to WSGIDaemonProcess directive, but due consideration should be given to the effects of changing it.
Thus, in respect of this specific issue, if you have long running requests still active when the first request comes in after you touched the WSGI script file, there is the risk that the active long requests will be interrupted.
The next notable thing you may see is that even if there are no long running requests and processes shutdown promptly, then it is still necessary to load up the WSGI application again within the new process. The time this takes will be seen as a delay in handling the request. How big that delay is will depend on the framework and your application. The worst offender as far as time taken to start up that I know of is TurboGears. Django somewhat better and the best as far as quick start up times being lightweight micro frameworks such as Flask.
Do note that any new requests which come in while these shutdown and startup delays occur should not be lost. This is because the HTTP listener socket has a certain depth and connections queue up in that waiting to be accepted. If the number of requests arriving is huge though and that queue fills up, then you will start to see connection refused errors in the browser.
No, there will be no downtime. Requests using the old code will complete, and new requests will use the new code.
There will be a small bit more load on the server as the new code loads but unless your application is colossal and your servers are already nearly overloaded this will be unnoticeable.
This is like the apachectl graceful command for Apache as a whole, which tells it to start a new configuration without downtime.