Investigating python process to see what's eating CPU

Investigating python process to see what's eating CPU - python

I have a python process (Pylons webapp) that is constantly using 10-30% of CPU. I'll improve/tune logging to get some insight of what's going on, but until then, are there any tools/techniques that allow to see what python process is doing, how many and how busy threads it has etc?
Update:
configured access log which shows that there are no requests going on, webapp is just idling
no point to plug in paste.profile in middleware chain since there are no requests, activity must be happening either in webapp's worker threads or paster web server
running paster like this: "python -m cProfile -o outfile /usr/bin/paster serve dev.ini" and inspecting results shows that most time is spent in "posix.waitpid". Paster runs webapp in subprocess, subprocess activity is not picked up by profiler
looking into ;hacking PasteScript "serve" command so that subprocesses would get profiled
Another update:
After much tinkering, sticking profiler in various places and getting familiar with PasteScript insides, I discovered that the constant CPU load goes away if application is started without "--reload" parameter (this flag tells paster to restart itself if code changes, handy in development), which is fine in production environment.

Profiling might help you learn a bit of what it's doing. If your sort the output by "time" you will see which functions are chowing up cpu time, which should give you some good hints.

As you noted, in --reload mode, Paste sweeps the filesystem every second to see if any of the files loaded have changed. If they have, then Paste reloads the process. You can also manually tell Paste to monitor non-Python code modules for changes if desired.
You can change the reload interval with the --reload-interval option, this will reduce the CPU usage when using --reload as it will sweep less often.

Related

Python web based interpreter security issues

I am making a web based python interpreter which will take code executes it on Linux based python3 interpreter and give output on the same web page. But this has some serious loop holes like someone can execute bash script using python's os module, can check directory for source code of the web application and a lot more.
Can anyone suggest me how to prevent this kind of mishaps in my application
Regards

Short answer: there is no easy "python-only" solution for this.
Some details:
user can always try to call os, sys, with open(SENSITIVE_PATH, 'rw') as f: ..., etc, and it's hard to detect all those cases simply by analyzing the code
If you allow ANY third-party, then things become even more complicated, for example some third-party package may locally create an alias to os.execv (os_ex = os.execv), and after this it will be possible to write a script like from thirdparty.some_internals import os_ex; os_ex(...).
The more or less reliable solution is to use "external sandboxing" solutions:
Run interpreter in the unprivileged docker container. For example:
write untrusted script to some file that will be exposed through volume in the docker container
execute that script in docker:
a. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', 'PATH_TO_SCRIPT'])
b. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', '-c', UNTRUSTED_SCRIPT_TEXT])
Use PyPy-s sandbox.
Search for some "secure" IPython kernel for Jupyter notebook server. Or write your own. Note: existing kernels are not guaranteed to be secure and may allow to call subprocess.check_output, os.rm and others. So for "default kernel" it's still better to run Jupyter server in the isolated environment.
Run interpreter in chroot using unprivileged user. Different implementations have different level of "safety".
Use Jython with finely tuned permissions.
Some exotic solutions like "client-side JS python implementation": brython, pyjs
In any case, even if you manage to implement or reuse existing "sandbox" you still will get many potential problems:
If multiprocessing or multithreading is allowed then you might want to monitor how CPU resources are utilized, because
some scripts might want to use EVERYTHING. Even with GIL it's possible for multi-threading to utilize all kernels (all the user has to do is to call functions that use c-libraries in the threads)
You might want to monitor memory usage, because some scripts might leak or simply use a lot of memory
Other candidates for monitoring: Disk IO usage, Network usage, open file descriptors usage, execution time, etc...
Also you should always check for security updates of your "sandboxing solution", because even docker sometimes is vulnerable and makes it possible to execute code on host machine
Recommended read: https://softwareengineering.stackexchange.com/questions/191623/best-practices-for-execution-of-untrusted-code

what is a robust way to execute long-running tasks/batches under Django?

I have a Django app that is intended to be run on Virtualbox VMs on LANs. The basic user will be a savvy IT end-user, not a sysadmin.
Part of that app's job is to connect to external databases on the LAN, run some python batches against those databases and save the results in its local db. The user can then explore the systems using Django pages.
Run time for the batches isn't all that long, but runs to minutes, tens of minutes potentially, not seconds. Run frequency is infrequent at best, I think you could spend days without needing a refresh.
This is not celery's normal use case of long tasks which will eventually push the results back into the web UI via ajax and/or polling. It is more similar to a dev's occasional use of the django-admin commands, but this time intended for an end user.
The user should be able to initiate a run of one or several of those batches when they want in order to refresh the calculations of a given external database (the target db is a parameter to the batch).
Until the batches are done for a given db, the app really isn't useable. You can access its pages, but many functions won't be available.
It is very important, from a support point of view that the batches remain easily runnable at all times. Dropping down to the VMs SSH would probably require frequent handholding which wouldn't be good - it is best that you could launch them from the Django webpages.
What I currently have:
Each batch is in its own script.
I can run it on the command line (via if __name__ == "main":).
The batches are also hooked up as celery tasks and work fine that way.
Given the way I have written them, it would be relatively easy for me to allow running them from subprocess calls in Python. I haven't really looked into it, but I suppose I could make them into django-admin commands as well.
The batches already have their own rudimentary status checks. For example, they can look at the calculated data and tell whether they have been run and display that in Django pages without needing to look at celery task status backends.
The batches themselves are relatively robust and I can make them more so. This is about their launch mechanism.
What's not so great.
In Mac dev environment I find the celery/celerycam/rabbitmq stack to be somewhat unstable. It seems as if sometime rabbitmqs daemon balloons up in CPU/RAM use and then needs to be terminated. That mightily confuses the celery processes and I find I have to kill -9 various tasks and relaunch them manually. Sometimes celery still works but celerycam doesn't so no task updates. Some of these issues may be OSX specific or may be due to the DEBUG flag being switched for now, which celery warns about.
So then I need to run the batches on the command line, which is what I was trying to avoid, until the whole celery stack has been reset.
This might be acceptable on a normal website, with an admin watching over it. But I can't have that happen on a remote VM to which only the user has access.
Given that these are somewhat fire-and-forget batches, I am wondering if celery isn't overkill at this point.
Some options I have thought about:
writing a cleanup shell/Python script to restart rabbitmq/celery/celerycam and generally make it more robust. i.e. whatever is required to make celery & all more stable. I've already used psutil to figure out rabbit/celery process are running and display their status in Django.
Running the batches via subprocess instead and avoiding celery. What about django-admin commands here? Does that make a difference? Still needs to be run from the web pages.
an alternative task/process manager to celery with less capability but also less moving parts?
not using subprocess but relying on Python multiprocessing module? To be honest, I have no idea how that compares to launches via subprocess.
environment:
nginx, wsgi, ubuntu on virtualbox, chef to build VMs.

I'm not sure how your celery configuration makes it unstable but sounds like it's still the best fit for your problem. I'm using redis as the queue system and it works better than rabbitmq from my own experience. Maybe you can try it see if it improves things.
Otherwise, just use cron as a driver to run periodic tasks. You can just let it run your script periodically and update the database, your UI component will poll the database with no conflict.

Pyramid (waitress) behind IIS randomly stops working when using APScheduler

I'm running a Pyramid application on Windows using Waitress as the app server and IIS as the Web Server (proxy). When I run the application, it works for a (seemingly) random amount of time before it just stops. It can go for days, even weeks at a time and then just stop, leaving IIS to throw a 502 error. When it stops, there's no way to restart it short of restarting Windows.
It's a small application which uses APScheduler to hit a couple of APIs to sync inventory between eBay/Amazon. I'm not entirely sure what is causing this problem, as there is no error shown in the logs. I had an older version of the application running (without APScheduler) and I didn't have this issue, so I'm assuming it has to do with APScheduler.
Has anybody else experienced this?

First let my say that I have not used APScheduler myself yet, and have next to no knowledge about running Python servers (or anything else really) on Windows and IIS.
So I can only guess here, but it seems clear that your problem is related to APScheduler in a way. I could imagine that something goes wrong in a thread that APScheduler uses for your background tasks, and the hanging thread brings down your whole application, due to the GIL (global interpreter lock in Python). This could for example happen when your thread(s) run into some kind of race condition.
Maybe it happens that processing starts before the previous iteration has finished processing. Or you get a really big backlog, and that leads to problems when processing starts.
Anyway, I think task queues are much better suited for background processing in web applications, because they run separately and out of the context of your web server.
You can schedule a task as soon as it's triggered by some user action, and it will be processed as soon as a worker is available, and not be deferred until a certain point in time.
I would recommend to give Celery a try, but there are also other solutions available, many based on Redis.
Celery is really powerful, and has advanced features like periodic tasks and crontab style schedules - so you can probably use it to do what you are doing using APScheduler now.
This looked very helpful for setting up Celery under Windows: http://mrtn.me/blog/2012/07/04/django-on-windows-run-celery-as-a-windows-service/
This may also prove helpful:
How to create Celery Windows Service?
Note: I haven't tried any of these myself, since I use Linux if I have a choice.
It's probably also possible to get APScheduler to work correctly, but I think it will be much easier to use Celery, because if a problem occurs in a worker you will be able to debug it much easier than a problem occuring in a background thread. Celery can also be configured to automatically send you email in case of an error.

Celery tasks profiling

As I can see in top utility celery procecess consume a lot of CPU time. So I want to profile it.
I can do it manually on developer machine like so:
python -m cProfile -o test-`date +%Y-%m-%d-%T`.prof ./manage.py celeryd -B
But to have accurate timings I need to profile it on production machine. On that machine (Fedora 14) celery is launched by init scripts. E.g.
service celeryd start
I have figured out these scripts eventually call manage.py celeryd_multi eventually. So my question is how can I tell celeryd_multi to start celery with profiling enabled? In my case this means add -m cProfile -o out.prof options to python.
Any help is much appreciated.

I think you're confusing two separate issues. You could be processing too many individual tasks or an individual task could be inefficient.
You may know which of these is the problem, but it's not clear from your question which it is.
To track how many tasks are being processed I suggest you look at celerymon. If a particular task appears more often that you would expect then you can investigate where it is getting called from.
Profiling the whole of celery is probably not helpful as you'll get lots of code that you have no control over. As you say it also means you have a problem running it in production. I suggest you look at adding the profiling code directly into your task definition.
You can use cProfile.run('func()') as a layer of indirection between celery and your code so each run of the task is profiled. If you generate a unique filename and pass it as the second parameter to run you'll have a directory full of profile data that you can inspect on a task-by-task basis, or use pstats.add to combine multiple task runs together.
Finally, per-task profiling means you can also turn profiling on or off using a setting in your project code either globally or by task, rather than needing to modify the init scripts on your server.

Configuring python

I am new to python and struggling to find how to control the amount of memory a python process can take? I am running python on a Cento OS machine with more than 2 GB of main memory size. Python is taking up only 128mb of this and I want to allocate it more. I tried to search all over the internet on this for last half an hour and found absolutely nothing! Why is it so difficult to find information on python related stuff :(
I would be happy if someone could throw some light on how to configure python for various things like allowed memory size, number of threads etc.
A link to a site where most controllable parameters of python are described would be appreciated well.

Forget all that, python just allocates more memory as needed, there is not a myriad of comandline arguments for the VM as in java, just let it run. For all comandline switches you can just run python -h or read man python.

Are you sure that the machine does not have a 128M process limit? If you are running the python script as a CGI inside a web server, it is quite likely that there is a process limit set - you will need to look at the web server configuration.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.