I'm looking at using inotify to watch about 200,000 directories for new files. On creation, the script watching will process the file and then it will be removed. Because it is part of a more compex system with many processes, I want to benchmark this and get system performance statistics on cpu, memory, disk, etc while the tests are run.
I'm planning on running the inotify script as a daemon and having a second script generating test files in several of the directories (randomly selected before the test).
I'm after suggestions for the best way to benchmark the performance of something like this, especially the impact it has on the Linux server it's running on.
I would try and remove as many other processes as possible in order to get a repeatable benchmark. For example, I would set up a separate, dedicated server with an NFS mount to the directories. This server would only run inotify and the Python script. For simple server measurements, I would use top or ps to monitor CPU and memory.
The real test is how quickly your script "drains" the directories, which depends entirely on your process. You could profile the script and see where it's spending the time.
Related
I am working on a python script that pulls data from an Access database via ODBC, and pulls it into a sqllite database managed by django.
The script takes a fair while to run, and so I was investigating where the bottle necks are and noticed in Task Manager that when running the script python only has a relatively small CPU usage <5% (6 core 12 thread system) but "Antimalware Service Executable" and Windows explorer" jump from virtually nothing to 16% and 10% respectively.
I attempted to add exclusions to windows defender to the python directory, the source code directory, and the location of the Access DB file but this do not make any noticeable effect.
As a small amount of background, the script runs thousands of queries so IO will being accessed quite frequently.
Is there any troubleshooting I can do to diagnose why this is happening and/or if it affecting performance.
I am making a web based python interpreter which will take code executes it on Linux based python3 interpreter and give output on the same web page. But this has some serious loop holes like someone can execute bash script using python's os module, can check directory for source code of the web application and a lot more.
Can anyone suggest me how to prevent this kind of mishaps in my application
Regards
Short answer: there is no easy "python-only" solution for this.
Some details:
user can always try to call os, sys, with open(SENSITIVE_PATH, 'rw') as f: ..., etc, and it's hard to detect all those cases simply by analyzing the code
If you allow ANY third-party, then things become even more complicated, for example some third-party package may locally create an alias to os.execv (os_ex = os.execv), and after this it will be possible to write a script like from thirdparty.some_internals import os_ex; os_ex(...).
The more or less reliable solution is to use "external sandboxing" solutions:
Run interpreter in the unprivileged docker container. For example:
write untrusted script to some file that will be exposed through volume in the docker container
execute that script in docker:
a. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', 'PATH_TO_SCRIPT'])
b. subprocess.call(['docker', 'exec', 'CONTAINER_ID', '/usr/bin/python', '-c', UNTRUSTED_SCRIPT_TEXT])
Use PyPy-s sandbox.
Search for some "secure" IPython kernel for Jupyter notebook server. Or write your own. Note: existing kernels are not guaranteed to be secure and may allow to call subprocess.check_output, os.rm and others. So for "default kernel" it's still better to run Jupyter server in the isolated environment.
Run interpreter in chroot using unprivileged user. Different implementations have different level of "safety".
Use Jython with finely tuned permissions.
Some exotic solutions like "client-side JS python implementation": brython, pyjs
In any case, even if you manage to implement or reuse existing "sandbox" you still will get many potential problems:
If multiprocessing or multithreading is allowed then you might want to monitor how CPU resources are utilized, because
some scripts might want to use EVERYTHING. Even with GIL it's possible for multi-threading to utilize all kernels (all the user has to do is to call functions that use c-libraries in the threads)
You might want to monitor memory usage, because some scripts might leak or simply use a lot of memory
Other candidates for monitoring: Disk IO usage, Network usage, open file descriptors usage, execution time, etc...
Also you should always check for security updates of your "sandboxing solution", because even docker sometimes is vulnerable and makes it possible to execute code on host machine
Recommended read: https://softwareengineering.stackexchange.com/questions/191623/best-practices-for-execution-of-untrusted-code
I need to run some numpy computation on 5000 files in parallel using python. I have the sequential single machine version implemented already. What would be the easiest way to run the code in parallel (say using an ec2 cluster)? Should I write my own task scheduler and job distribution code?
You can have a look pscheduler Python module. It will allow you to queue up your jobs and run them sequentially. The number of concurrent processes will depend upon the available CPU cores. This program can easily scale up and submit your jobs to remote machines but then would require all your remote machines to use NFS.
I'll be happy to help you further.
I have a Django app that is intended to be run on Virtualbox VMs on LANs. The basic user will be a savvy IT end-user, not a sysadmin.
Part of that app's job is to connect to external databases on the LAN, run some python batches against those databases and save the results in its local db. The user can then explore the systems using Django pages.
Run time for the batches isn't all that long, but runs to minutes, tens of minutes potentially, not seconds. Run frequency is infrequent at best, I think you could spend days without needing a refresh.
This is not celery's normal use case of long tasks which will eventually push the results back into the web UI via ajax and/or polling. It is more similar to a dev's occasional use of the django-admin commands, but this time intended for an end user.
The user should be able to initiate a run of one or several of those batches when they want in order to refresh the calculations of a given external database (the target db is a parameter to the batch).
Until the batches are done for a given db, the app really isn't useable. You can access its pages, but many functions won't be available.
It is very important, from a support point of view that the batches remain easily runnable at all times. Dropping down to the VMs SSH would probably require frequent handholding which wouldn't be good - it is best that you could launch them from the Django webpages.
What I currently have:
Each batch is in its own script.
I can run it on the command line (via if __name__ == "main":).
The batches are also hooked up as celery tasks and work fine that way.
Given the way I have written them, it would be relatively easy for me to allow running them from subprocess calls in Python. I haven't really looked into it, but I suppose I could make them into django-admin commands as well.
The batches already have their own rudimentary status checks. For example, they can look at the calculated data and tell whether they have been run and display that in Django pages without needing to look at celery task status backends.
The batches themselves are relatively robust and I can make them more so. This is about their launch mechanism.
What's not so great.
In Mac dev environment I find the celery/celerycam/rabbitmq stack to be somewhat unstable. It seems as if sometime rabbitmqs daemon balloons up in CPU/RAM use and then needs to be terminated. That mightily confuses the celery processes and I find I have to kill -9 various tasks and relaunch them manually. Sometimes celery still works but celerycam doesn't so no task updates. Some of these issues may be OSX specific or may be due to the DEBUG flag being switched for now, which celery warns about.
So then I need to run the batches on the command line, which is what I was trying to avoid, until the whole celery stack has been reset.
This might be acceptable on a normal website, with an admin watching over it. But I can't have that happen on a remote VM to which only the user has access.
Given that these are somewhat fire-and-forget batches, I am wondering if celery isn't overkill at this point.
Some options I have thought about:
writing a cleanup shell/Python script to restart rabbitmq/celery/celerycam and generally make it more robust. i.e. whatever is required to make celery & all more stable. I've already used psutil to figure out rabbit/celery process are running and display their status in Django.
Running the batches via subprocess instead and avoiding celery. What about django-admin commands here? Does that make a difference? Still needs to be run from the web pages.
an alternative task/process manager to celery with less capability but also less moving parts?
not using subprocess but relying on Python multiprocessing module? To be honest, I have no idea how that compares to launches via subprocess.
environment:
nginx, wsgi, ubuntu on virtualbox, chef to build VMs.
I'm not sure how your celery configuration makes it unstable but sounds like it's still the best fit for your problem. I'm using redis as the queue system and it works better than rabbitmq from my own experience. Maybe you can try it see if it improves things.
Otherwise, just use cron as a driver to run periodic tasks. You can just let it run your script periodically and update the database, your UI component will poll the database with no conflict.
I have a web service to which users upload python scripts that are run on a server. Those scripts process files that are on the server and I want them to be able to see only a certain hierarchy of the server's filesystem (best: a temporary folder on which I copy the files I want processed and the scripts).
The server will ultimately be a linux based one but if a solution is also possible on Windows it would be nice to know how.
What I though of is creating a user with restricted access to folders of the FS - ultimately only the folder containing the scripts and files - and launch the python interpreter using this user.
Can someone give me a better alternative? as relying only on this makes me feel insecure, I would like a real sandboxing or virtual FS feature where I could run safely untrusted code.
Either a chroot jail or a higher-order security mechanism such as SELinux can be used to restrict access to specific resources.
You are probably best to use a virtual machine like VirtualBox or VMware (perhaps even creating one per user/session).
That will allow you some control over other resources such as memory and network as well as disk
The only python that I know of that has such features built in is the one on Google App Engine. That may be a workable alternative for you too.
This is inherently insecure software. By letting users upload scripts you are introducing a remote code execution vulnerability. You have more to worry about than just modifying files, whats stopping the python script from accessing the network or other resources?
To solve this problem you need to use a sandbox. To better harden the system you can use a layered security approach.
The first layer, and the most important layer is a python sandbox. User supplied scripts will be executed within a python sandbox. This will give you the fine grained limitations that you need. Then, the entire python app should run within its own dedicated chroot. I highly recommend using the grsecurity kernel modules which improve the strength of any chroot. For instance a grsecuirty chroot cannot be broken unless the attacker can rip a hole into kernel land which is very difficult to do these days. Make sure your kernel is up to date.
The end result is that you are trying to limit the resources that an attacker's script has. Layers are a proven approach to security, as long as the layers are different enough such that the same attack won't break both of them. You want to isolate the script form the rest of the system as much as possible. Any resources that are shared are also paths for an attacker.