I am new to python and struggling to find how to control the amount of memory a python process can take? I am running python on a Cento OS machine with more than 2 GB of main memory size. Python is taking up only 128mb of this and I want to allocate it more. I tried to search all over the internet on this for last half an hour and found absolutely nothing! Why is it so difficult to find information on python related stuff :(
I would be happy if someone could throw some light on how to configure python for various things like allowed memory size, number of threads etc.
A link to a site where most controllable parameters of python are described would be appreciated well.
Forget all that, python just allocates more memory as needed, there is not a myriad of comandline arguments for the VM as in java, just let it run. For all comandline switches you can just run python -h or read man python.
Are you sure that the machine does not have a 128M process limit? If you are running the python script as a CGI inside a web server, it is quite likely that there is a process limit set - you will need to look at the web server configuration.
Related
I have a python script that works fine on my main computer without problems. But when I uploaded it to the Ubuntu server it started crashing. I thought for a long time what the problem was and looked at the system logs. It turned out that ubuntu automatically forcibly terminates the script due to lack of memory (server configuration is 512 MB of RAM), how can I debug the program on the consumed memory in different work options?
Have a look at something like Guppy3, which includes heapy, a 'heap analysis toolset' that can help you find where the memory's being used/held. Some links to information on how to use it are in the project's README.
If you have a core, consider using https://github.com/vmware/chap, which will allow you to look at both python and native allocations.
Once you have opened the core, probably "summarize used" is a good place to start.
I am making a web based python interpreter which will take code executes it on Linux based python3 interpreter and give output on the same web page. But this has some serious loop holes like someone can execute bash script using python's os module, can check directory for source code of the web application and a lot more.
Can anyone suggest me how to prevent this kind of mishaps in my application
Regards
Short answer: there is no easy "python-only" solution for this.
Some details:
user can always try to call os, sys, with open(SENSITIVE_PATH, 'rw') as f: ..., etc, and it's hard to detect all those cases simply by analyzing the code
If you allow ANY third-party, then things become even more complicated, for example some third-party package may locally create an alias to os.execv (os_ex = os.execv), and after this it will be possible to write a script like from thirdparty.some_internals import os_ex; os_ex(...).
The more or less reliable solution is to use "external sandboxing" solutions:
Run interpreter in the unprivileged docker container. For example:
write untrusted script to some file that will be exposed through volume in the docker container
execute that script in docker:
a. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', 'PATH_TO_SCRIPT'])
b. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', '-c', UNTRUSTED_SCRIPT_TEXT])
Use PyPy-s sandbox.
Search for some "secure" IPython kernel for Jupyter notebook server. Or write your own. Note: existing kernels are not guaranteed to be secure and may allow to call subprocess.check_output, os.rm and others. So for "default kernel" it's still better to run Jupyter server in the isolated environment.
Run interpreter in chroot using unprivileged user. Different implementations have different level of "safety".
Use Jython with finely tuned permissions.
Some exotic solutions like "client-side JS python implementation": brython, pyjs
In any case, even if you manage to implement or reuse existing "sandbox" you still will get many potential problems:
If multiprocessing or multithreading is allowed then you might want to monitor how CPU resources are utilized, because
some scripts might want to use EVERYTHING. Even with GIL it's possible for multi-threading to utilize all kernels (all the user has to do is to call functions that use c-libraries in the threads)
You might want to monitor memory usage, because some scripts might leak or simply use a lot of memory
Other candidates for monitoring: Disk IO usage, Network usage, open file descriptors usage, execution time, etc...
Also you should always check for security updates of your "sandboxing solution", because even docker sometimes is vulnerable and makes it possible to execute code on host machine
Recommended read: https://softwareengineering.stackexchange.com/questions/191623/best-practices-for-execution-of-untrusted-code
The Issue
I am using my laptop with Apache to act as a server for a local project involving tensorflow and python which uses an API written in Flask to service GET and POST requests coming from an app and maybe another user on the local network.The problem is that the initial page keeps loading when I specifically import tensorflow or the object detection folder within the research folder in the tensorflow github folder, and it never seems to finish doing so, effectively getting it stuck. I suspect the issue has to do with the packages being large in size, but I didn't have any issue with that when running the application on the development server provided with Flask.
Are there any pointers that I should look for when trying to solve this issue? I checked the memory usage, and it doesn't seem to be rising substantially, as well as the CPU usage.
Debugging process
I am able to print basic hello world to the root page quite quickly, but I isolated the issue to the point when the importing takes place where it gets stuck.
The only thing I can think of is to limit the number of threads that are launched, but when I limited the number of threads per child to 5 and number of connections to 5 in the httpd-mpm.conf file, it didn't help.
The error/access logs don't provide much insight to the matter.
A few notes:
Thus far, I used Flask's development server with multi-threading enabled to serve those requests, but I found it to be prone to crashing after 5 minutes of continuous run, so I am now trying to use Apache using the wsgi interface in order to use Python scripts.
I should also note that I am not servicing html files, just basic GET and POST requests. I am just viewing them using the browser.
If it helps, I also don't use virtual environments.
I am using Windows 10, Apache 2.4 and mod_wsgi 4.5.24
The tensorflow module being a C extension module, may not be implemented so it works properly in Python sub interpreters. To combat this, force your application to run in the main Python interpreter context. Details in:
http://modwsgi.readthedocs.io/en/develop/user-guides/application-issues.html#python-simplified-gil-state-api
I have a large project that runs on an application server. It does pipelined processing of large batches of data and works fine on one Linux system (the old production environment) and one windows system (my dev environment).
However, we're upgrading our infrastructure and moving to a new linux system for production, based on the same image used for the existing production system (we use AWS). The python version (2.7) and libraries should be identical because of this, we're verifying this on our own using file hashes, also.
Our issue is that when we attempt to start processing on the new server, we receive a very strange output written to standard out followed by hanging of the server, "Removing descriptor: [some number]". I cannot duplicate this on the dev machine.
Has anyone ever encountered behavior like this in python before? Besides modules in the python standard library we are also using eventlet and beautifulsoup. In the standard library we lean heavily on urllib2, re, cElementTree, and multiprocessing (mostly the pools).
wberry was correct in his comment, I was running into a max descriptors per process issue. This seems highly dependent on operating system. Reducing the size of the batches I was having each processor handle to below the file descriptor limit of the process solved the problem.
I have a python process (Pylons webapp) that is constantly using 10-30% of CPU. I'll improve/tune logging to get some insight of what's going on, but until then, are there any tools/techniques that allow to see what python process is doing, how many and how busy threads it has etc?
Update:
configured access log which shows that there are no requests going on, webapp is just idling
no point to plug in paste.profile in middleware chain since there are no requests, activity must be happening either in webapp's worker threads or paster web server
running paster like this: "python -m cProfile -o outfile /usr/bin/paster serve dev.ini" and inspecting results shows that most time is spent in "posix.waitpid". Paster runs webapp in subprocess, subprocess activity is not picked up by profiler
looking into ;hacking PasteScript "serve" command so that subprocesses would get profiled
Another update:
After much tinkering, sticking profiler in various places and getting familiar with PasteScript insides, I discovered that the constant CPU load goes away if application is started without "--reload" parameter (this flag tells paster to restart itself if code changes, handy in development), which is fine in production environment.
Profiling might help you learn a bit of what it's doing. If your sort the output by "time" you will see which functions are chowing up cpu time, which should give you some good hints.
As you noted, in --reload mode, Paste sweeps the filesystem every second to see if any of the files loaded have changed. If they have, then Paste reloads the process. You can also manually tell Paste to monitor non-Python code modules for changes if desired.
You can change the reload interval with the --reload-interval option, this will reduce the CPU usage when using --reload as it will sweep less often.