Tracing Memory Leaks in Python using Dowser - python

I am running some tests nightly on a VM with a centos operating system. Recently the tests have been taking up all the memory available and nearly all the swap memory on the machine, I assigned the VM twice as much memory and it's still happening, which results in the physical host machine of the VM dying. These tests were previously running without needing half as much memory so I need to use some form of python memory analyzer to investigate what is going on.
I've looked at Pysizer and Heapy -- but after research Dowser seems to be the one I'm after as it requires zero changes to code.
So far from the documentation and googling I've got this code in it's own class:
import cherrypy
import dowser
class MemoryAnalyzer:
def memoryCheck(self):
cherrypy.config.update({'server.socket_port':8080})
cherrypy.tree.mount(dowser.Root())
cherrypy.engine.start()
I was hoping this would bring up the web interface shown in the documentation to track all instance of python running on the host, which doesn't work. I was confused by the documentation:
'python dowser __init__.py'.
Is it possible to just run this? I get the error :
/usr/bin/python: can't find '__main__.py' in 'dowser'
Can dowser run independently from my test suite on the VM? Or will I have to implement this above code into my main class to run my tests to trace instances of python?

Dowser is meant to be run as part of your application. Therefore, wherever you initialize the application, add the lines
import dowser
cherrypy.tree.mount(dowser.Root(), '/dowser')
Then you can browse to http://localhost:8080/dowser to view the dowser interface.
Note that the invocation you quoted from the documentation is for testing dowser. The correct invocation for that is python dowser/__init__.py.

Managed to get dowser to work using this blog http://www.aminus.org/blogs/index.php/2008/06/11/tracking-memory-leaks-with-dowser?blog=2 and changing the port to 8088 instead of 8080(which wasn't in use on the machine but still doesn't work!)

Related

Python web based interpreter security issues

I am making a web based python interpreter which will take code executes it on Linux based python3 interpreter and give output on the same web page. But this has some serious loop holes like someone can execute bash script using python's os module, can check directory for source code of the web application and a lot more.
Can anyone suggest me how to prevent this kind of mishaps in my application
Regards
Short answer: there is no easy "python-only" solution for this.
Some details:
user can always try to call os, sys, with open(SENSITIVE_PATH, 'rw') as f: ..., etc, and it's hard to detect all those cases simply by analyzing the code
If you allow ANY third-party, then things become even more complicated, for example some third-party package may locally create an alias to os.execv (os_ex = os.execv), and after this it will be possible to write a script like from thirdparty.some_internals import os_ex; os_ex(...).
The more or less reliable solution is to use "external sandboxing" solutions:
Run interpreter in the unprivileged docker container. For example:
write untrusted script to some file that will be exposed through volume in the docker container
execute that script in docker:
a. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', 'PATH_TO_SCRIPT'])
b. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', '-c', UNTRUSTED_SCRIPT_TEXT])
Use PyPy-s sandbox.
Search for some "secure" IPython kernel for Jupyter notebook server. Or write your own. Note: existing kernels are not guaranteed to be secure and may allow to call subprocess.check_output, os.rm and others. So for "default kernel" it's still better to run Jupyter server in the isolated environment.
Run interpreter in chroot using unprivileged user. Different implementations have different level of "safety".
Use Jython with finely tuned permissions.
Some exotic solutions like "client-side JS python implementation": brython, pyjs
In any case, even if you manage to implement or reuse existing "sandbox" you still will get many potential problems:
If multiprocessing or multithreading is allowed then you might want to monitor how CPU resources are utilized, because
some scripts might want to use EVERYTHING. Even with GIL it's possible for multi-threading to utilize all kernels (all the user has to do is to call functions that use c-libraries in the threads)
You might want to monitor memory usage, because some scripts might leak or simply use a lot of memory
Other candidates for monitoring: Disk IO usage, Network usage, open file descriptors usage, execution time, etc...
Also you should always check for security updates of your "sandboxing solution", because even docker sometimes is vulnerable and makes it possible to execute code on host machine
Recommended read: https://softwareengineering.stackexchange.com/questions/191623/best-practices-for-execution-of-untrusted-code

Suppressing multi-threading in used libraries?

EDIT:
I ended up using a workaround to get the behaviour I wanted.
Disabling threading in the SSHTunnel as suggested in the accepted answer helped me pin down the problem.
I have a Python project that does a few things, mostly ETL.
It works fine when I run it locally, works fine when I stuff it into a docker container and run that locally, but deadlocks 80% in when I run that docker container in the cloud.
When I manually kill the process I get the error linked below, suggesting it is a threading issue. I'm not explicitly using threading anywhere in my code (and am no expert on the subject) and assume it's one of the libraries I'm using employing threading internally.
The idea I had to resolve this problem is to somehow suppress all threading that is happening in the function calls of the libraries I use.
Is there a catch-all way to do that in Python?
Steps of the program include moving PostGresQL data into Google BigQuery, then fetching data from BigQuery (including the new data), creating an Excel report out of that data and emailing it out.
Pandas' data frames are used for the internal representation and easy upload to GBQ using the to_gbq method.
sqlalchemy and sshtunnel are used to extract data from the Postgresql database.
Openpyxl is used for the Excel editing.
The whole thing takes less than a minute to run locally (either in- or outside of a docker container) and manually calling each of the steps separately on the server also works fine.
(The referenced cloud deployment is on a Google Cloud VM instance)
I can't think of any way to globally disable threading; at least not without breaking every piece of code that would use it.
Judging by the traceback, I assume you are using SSHTunnelForwarder from the sshtunnel package. This class takes a boolean argument threaded with True as a default value.
Instantiating SSHTunnelForwarder with threaded=False will disable the use of the _ThreadingForwardServer in favor of the _ForwardServer. This forward server is not using the socketserver.ThreadingMixIn, which is where your block seems to be surfacing. So, that should fix your problem.
However, I'd be curious to know why your project blocks in the cloud context. Judging by the output in your screenshot, the whole thing seems to be almost complete and just hangs when shutting down the tunnel forwarder. The maintainers of the sshtunnel package surely made the use of threading a default for a reason. I'd want to stick to that default if in any way possible, but that's just me :)

Slow page loading on apache when using Flask

The Issue
I am using my laptop with Apache to act as a server for a local project involving tensorflow and python which uses an API written in Flask to service GET and POST requests coming from an app and maybe another user on the local network.The problem is that the initial page keeps loading when I specifically import tensorflow or the object detection folder within the research folder in the tensorflow github folder, and it never seems to finish doing so, effectively getting it stuck. I suspect the issue has to do with the packages being large in size, but I didn't have any issue with that when running the application on the development server provided with Flask.
Are there any pointers that I should look for when trying to solve this issue? I checked the memory usage, and it doesn't seem to be rising substantially, as well as the CPU usage.
Debugging process
I am able to print basic hello world to the root page quite quickly, but I isolated the issue to the point when the importing takes place where it gets stuck.
The only thing I can think of is to limit the number of threads that are launched, but when I limited the number of threads per child to 5 and number of connections to 5 in the httpd-mpm.conf file, it didn't help.
The error/access logs don't provide much insight to the matter.
A few notes:
Thus far, I used Flask's development server with multi-threading enabled to serve those requests, but I found it to be prone to crashing after 5 minutes of continuous run, so I am now trying to use Apache using the wsgi interface in order to use Python scripts.
I should also note that I am not servicing html files, just basic GET and POST requests. I am just viewing them using the browser.
If it helps, I also don't use virtual environments.
I am using Windows 10, Apache 2.4 and mod_wsgi 4.5.24
The tensorflow module being a C extension module, may not be implemented so it works properly in Python sub interpreters. To combat this, force your application to run in the main Python interpreter context. Details in:
http://modwsgi.readthedocs.io/en/develop/user-guides/application-issues.html#python-simplified-gil-state-api

Forbid Python from writing anything to disk

Are there any command-line options or configurations that forbids Python from writing to disk?
I know I can hack open but it doesn't sound very safe.
I've hosted some Python tutorials I wrote myself on my website for friends who want to learn Python, and I want them to have access to a Python console so they can try as they learn. This is done by creating a Python subprocess from the http server.
However, I do not want them to accidentally or intentionally damage my server, so I need to forbid the Python process from writing anything to disk.
Also I'm running the server on Ubuntu Linux so doing it Python-wise or system-wise are both OK.
I doubt there's a way to do this in the interpreter itself: there are way too many things to patch (open, subprocess, os.system, file, and probably others). I'd suggest looking into a way of containerizing the python runtime via something like Docker. The containerization gives some guarantees restricting access, though not as much as virtualization. See here for more discussion about the security implications.
Running a jupyter/ipython notebook in the docker container would probably be the easiest way to expose a web-frontend. jupyter provides a collection of docker containers for this purpose: see https://github.com/jupyter/tmpnb and https://github.com/jupyter/docker-stacks

Investigating python process to see what's eating CPU

I have a python process (Pylons webapp) that is constantly using 10-30% of CPU. I'll improve/tune logging to get some insight of what's going on, but until then, are there any tools/techniques that allow to see what python process is doing, how many and how busy threads it has etc?
Update:
configured access log which shows that there are no requests going on, webapp is just idling
no point to plug in paste.profile in middleware chain since there are no requests, activity must be happening either in webapp's worker threads or paster web server
running paster like this: "python -m cProfile -o outfile /usr/bin/paster serve dev.ini" and inspecting results shows that most time is spent in "posix.waitpid". Paster runs webapp in subprocess, subprocess activity is not picked up by profiler
looking into ;hacking PasteScript "serve" command so that subprocesses would get profiled
Another update:
After much tinkering, sticking profiler in various places and getting familiar with PasteScript insides, I discovered that the constant CPU load goes away if application is started without "--reload" parameter (this flag tells paster to restart itself if code changes, handy in development), which is fine in production environment.
Profiling might help you learn a bit of what it's doing. If your sort the output by "time" you will see which functions are chowing up cpu time, which should give you some good hints.
As you noted, in --reload mode, Paste sweeps the filesystem every second to see if any of the files loaded have changed. If they have, then Paste reloads the process. You can also manually tell Paste to monitor non-Python code modules for changes if desired.
You can change the reload interval with the --reload-interval option, this will reduce the CPU usage when using --reload as it will sweep less often.

Categories

Resources