testing command line utilities - python

I'm looking for a way to run tests on command-line utilities written in bash, or any other language.
I'd like to find a testing framework that would have statements like
setup:
command = 'do_awesome_thing'
filename = 'testfile'
args = ['--with', 'extra_win', '--file', filename]
run_command command args
test_output_was_correct
assert_output_was 'Creating awesome file "' + filename + '" with extra win.'
test_file_contains_extra_win
assert_file_contains filename 'extra win'
Presumably the base test case would set up a temp directory in which to run these commands, and remove it at teardown.
I would prefer to use something in Python, since I'm much more familiar with it than with other plausible candidate languages.
I imagine that there could be something using a DSL that would make it effectively language-agnostic (or its own language, depending on how you look at it); however this might be less than ideal, since my testing techniques usually involve writing code that generates tests.
It's a bit difficult to google for this, as there is a lot of information on utilities which run tests, which is sort of the converse of what I'm looking for.
Support for doctests embedded in the output of command --help would be an extra bonus :)

Check out ScriptTest :
from scripttest import TestFileEnvironment
env = TestFileEnvironment('./scratch')
def test_script():
env.reset()
result = env.run('do_awesome_thing testfile --with extra_win --file %s' % filename)
# or use a list like ['do_awesome_thing', 'testfile', ...]
assert result.stdout.startswith('Creating awesome file')
assert filename in result.files_created
It's reasonably doctest-usable as well.

Well... What we usually do (and one of the wonders of O.O. languages) is to write all the components of an application before actually make the application. Every component might have an standalone way to be executed, for testing purpose (command line, usually), that also allows you to think in them as complete programs each by each, and use them in future projects. If what you want is to test the integrity of an existing program... well, I think the best way is to learn in deep how it work, or even deeper: read the source. Or even deeper: develop a bot to force-test it :3
Sorry that's what I have .-.

I know that this question is old, but since I was looking for an answer, I figured I would add my own for anyone else who happens along.
Full disclaimer: The project I am mentioning is my own, but it is completely free and open source.
I ran into a very similar problem, and ended up rolling my own solution. The test code will look like this:
from CLITest import CLITest, TestSuite
from subprocess import CalledProcessError
class TestEchoPrintsToScreen(CLITest):
'''Tests whether the string passed in is the string
passed out'''
def test_output_contains_input(self):
self.assertNotIsInstance(self.output, CalledProcessError)
self.assertIn("test", self.output)
def test_ouput_equals_input(self):
self.assertNotIsInstance(self.output, CalledProcessError)
self.assertEqual("test", self.output)
suite = TestSuite()
suite.add_test(TestEchoPrintsToScreen("echo test"))
suite.run_tests()
This worked well enough to get me through my issues, but I know it could use some more work to make it as robust as possible (test discovery springs to mind). It may help, and I always love a good pull request.

outside of any prepackaged testing framework that may exist but I am unaware of, I would just point out that expect is an awesome and so underutilized tool for this kind of automation, especially if you want to support multistage interaction, which is to say not just send a command and check output but respond to output with more input. If you wind up building your own system, it's worth looking into.
There is also python reimplementation of expect called pexpect.There may be some direct interfaces to the expect library available as well. I'm not a python guy so I couldn't tell you much about them.

Related

What is the state of the art way to handle what makefiles do for python data analysis?

I have a program that is a DAG which process and cleans certain files, combines them, then does additional calculations. I want a way to run the whole analysis pipeline, and re-run if anything changes, but without having to re-process every single component.
I read about Makefiles and thought that it sounds like the perfect solution. I am also aware that it is probably outdated and that better alternatives probably exist, but I generally only find large lists of work flow scheduler tools that are not quite suited to this purpose, as far as I can tell (e.g., Airflow, Luigi, Nextflow, Dagobah, etc., etc.)
It seems like many of these are overkill with schedulers, GUIs, etc. which I don't really need. I just want one file that does the following:
makes it obvious what all of the python scripts are that need to run
shows file dependencies so that a full re-run will only redo parts where something has been changed upstream
has the potential for some parallelization (not very necessary)
doesn't have too much boilerplate
Makefile example:
.PHONY : dats
dats : isles.dat abyss.dat
isles.dat : books/isles.txt
python countwords.py books/isles.txt isles.dat
abyss.dat : books/abyss.txt
python countwords.py books/abyss.txt abyss.dat
.PHONY : clean
clean :
rm -f *.dat
Is this the best procedure to run something like this in python or is there a better way?
DVC (Data Version Control) includes a modern re-implementation and extension of make that is particularly suited to data-science pipelines (see here).
Handling pipelines in DVC has important benefits over make for many scenarios, such as relying on file checksum rather than modification-time. On the contrary, make is simpler in some sense, and it has a powerful macro mechanism. Still, there are elements in the syntax of makefiles that are quite subtle (e.g., multiple outputs, intermediate files), and make generally doesn't support whitespace in filenames.
Is this the best procedure to run something like this in python or is there a better way?
"Best" is surely in the eye of the beholder. However, if the make-based approach presented in the question is satisfactorily representative of the problem then it is a good way. make implementations are very widely available, and their behavior is well understood and generally well-suited to problems such as the one presented.
There are other build tools that compete with make, some written in Python, and there are undoubtedly some more esoteric software frameworks that could be applied to the task. Nevertheless, if you want to focus on doing the work instead of on building the framework to do the work, then I don't see any reason to look past the make-based solution you already have.
The way you present the question, I would say snakemake is the way to go. Having said that, GNU make may be old but is not going to disappear any time soon and it's been tested and tried to death.
I don't speak make, but I think your example Makefile in snakemake would be something like this:
rule all:
input:
['isles.dat', 'abyss.dat'],
rule make_isles:
input:
'books/isles.txt',
output:
'isles.dat',
shell:
r"""
python countwords.py {input} {output}
"""
rule make_abyss:
input:
'books/abyss.txt',
output:
'abyss.dat',
shell:
r"""
python countwords.py {input} {output}
"""
Save this in a file called Snakefile and execute it as:
snakemake # vanilla execution
snakemake -p -n # Print shell commands (-p). Dry-run mode (-n)
snakemake --delete-all-output # Same-ish as .PHONY clean
snakemake is popular in bioinformatics but it has pretty general purpose.

Saving temporary files with CGI (Ubuntu)

I am a postdoc and I just finished a cool little scientific application in Python and want to share it with the world. It's a really useful tool for genetecists.
I'd really like to let people run this program through a CGI form interface. Since I'm not a student anymore, I no longer have webspace with a tidy little cgi-bin subdirectory that's hooked up perfectly.
I wrote a simple CGI Python program a few years ago, and was trying to use this as a template.
Here is my quesion:
My program needs to create temporary files (when run from the command line it saves images to a given path).
I've read a couple tutorials on Apache, etc. and got lots of things running, but I can't figure out how to let my program write temporary files (I also don't know where these files would live, etc.). Any time I try to write to a file (in any manner) in my Python program, the CGI "crashes" and doesn't seem OK.
I am not extremely worried about security because the temporary files will only be outputs of the program (not the user input).
And while I'm asking (I'm assuming you're kind of a CGI ninja if you got this far and weren't bored), do you know my CGI program can take a file argument without making a temporary file?
My previous approach to this was to simply take a list of text as an argument:
try:
if item.file:
data = item.file.read()
if check:
Tools_file.main(["ExeName", "-d", "-w " + data])
else:
Tools_file.main(["ExeName", "-s", "-d", "-w " + data])
...
I'd like to do this the right way! Cheers in advance.
Stack overflowingly yours,
Oliver
Well, the "right" way is probably to re-work things using an existing web framework like Django. It's probably overkill in this case. Don't underestimate the security aspects here. They're probably more relevant than you think.
All that said, you probably want to use Python's temp file module from the standard library:
http://docs.python.org/library/tempfile.html
It'll generally write stuff out to /tmp/whatever if you're on unix. If your program is crashing only when run under apache (but runs fine when you execute it directly), check your permissions. Make sure your apache user has permission to write to wherever you've decided to store your temp files. Make sure the temp files are written with appropriate permissions (don't want to write a file that you can't read later on).
As Paul McMillan said, use tempfile:
temp, temp_filename = tempfile.mkstemp(text = True)
temp_output = os.fdopen(temp, 'w')
temp_output.write(something_or_other)
temp_output.close()
My personal opinion is that frameworks are a big time sink unless you really need the prebuilt functionality. CGI is far simpler and can probably work for your application, at least until it gets really popular.

What is the Python equivalent of Lame MP3 Converter?

I need to convert mp3 audio files to 64kbps on the server side.
Right now, I am using subprocess to call lame, but I wonder if there are any good alternatives?
There seems to be a slightly old thread on that topic here: http://www.dreamincode.net/forums/topic/72083-lame-mp3-encoder-for-python/
The final conclusion was to create a custom binding to lame_enc.dll via Python->C bindings.
The reason for that conclusion was that the existing binding libraries (pymedia/py-lame) have not been maintained.
Unfortunately the guy didn't get it to work :)
Maybe you should continue to use subprocess. You could take advantage of that choice, abstract your encoding at a slightly higher level, and reuse the code/strategy to optionally execute other command line encoding tools (such as ogg or shn tools).
I've seen several audio ripping tools adopt that strategy.
I've been working with Python Audio Tools, which is capable of make conversions between different audio formats.
I've already used it to convert .wav files into mp3, .flac and .m4a.
If you want to use LAME to encode your MP3s (and not PyMedia), you can always use ctypes to wrap the lame encoder DLL (or .so if you are on Linux). The exact wrapper code you'll use is going to be tied to the LAME DLL version (and there are many of these flying around, unfortunately), so I can't really give you any example, but the ctypes docs should be clear enough about wrapping DLLs.
Caveat: relatively new programmer here and I haven't had a need to convert audio files before.
However, if I understand what you mean by server-side, correctly, you might be looking for a good approach to manage mass conversions, and your interest in a python solution might be in part to be able to better manage the resource use or integrate into your processing chain. I had a similar problem/goal, which I resolved using a mix of Merlyn's recommendation and Celery. I don't use django-celery, but if this is for a django-based project, that might appeal to you as well. You can find out more about celery here:
http://celeryproject.org/community.html
http://ask.github.com/celery/getting-started/introduction.html
Depending on what you have setup already, there may be a little upfront time needed to get setup. To take advantage of everything you'll need rabbitmq/erlang installed, but if you follow the guide on the sites above, it's pretty quick now.
Here's an example of how I use celery with subprocess to address a similar issue. Similar to the poster's suggestion above, I use subprocess to call ffmpeg, which is as good as it gets for video tools, and probably would actually be as good as it gets for audio tools too. I'm including a bit more than necessary here to give you a feel for how you might configure your own a little.
#example of configuring an option, here I'm selecting how much I want to adjust bitrate
#based on my input's format
def generate_command_line_method(self):
if self.bitrate:
compression_dict = {'.mp4':1.5, '.rm':1.5, '.avi': 1.2,
'.mkv': 1.2, '.mpg': 1, '.mpeg':1}
if self.ext.lower() in compression_dict.keys():
compression_factor = compression_dict[self.ext.lower()]
#Making a list to send to the command line through subprocess
ffscript = ['ffmpeg',
'-i', self.fullpath,
'-b', str(self.bitrate * compression_factor),
'-qscale', '3', #quality factor, based on trial and error
'-g', '90', #iframe roughly per 3 seconds
'-intra',
outpath
]
return ffscript
#The celery side of things, I'd have a celeryconfig.py file in the
#same directory as the script that points to the following function, so my task
#queue would know the specifics of the function I'll call through it. You can
#see example configs on the sites above, but it's basically just going to be
#a tuple that says, here are the modules I want you to look in, celery, e.g.
#CELERY_MODULES = ("exciting_asynchronous_module.py",). This file then contains,
from celery.decorators import task
from mymodule import myobject
from subprocess import Popen
#task(time_limit=600) #say, for example, 10 mins
def run_ffscript(ffscript):
some_result = Popen(ffscript).wait()
#Note: we'll wait because we don't want to compound
#the asynchronous aspect (we don't want celery to launch the subprocess and think
#it has finished.
#Then I start up celery/rabbitmq, and got into my interactive shell (ipython shown):
#I'll have some generator feeding these ffscripts command lines, then process them
#with something like:
In[1]: for generated_ffscript in generator:
run_ffscript.delay(generated_ffscript)
Let me know if this was useful to you. I'm relatively new to answering questions here and not sure if my attempts are helpful or not. Good luck!
Well, Gstreamer has the "ugly plugin" lamemp3enc and there are python bindings for Gstreamer (gst-python 1.2, supports python 3.3). I haven't tried going this route myself so I'm not really in a position to recommend anything... Frankly, a subprocess solution seems a lot simpler, if not "cleaner", to me.

Dangerous Python Keywords?

I am about to get a bunch of python scripts from an untrusted source.
I'd like to be sure that no part of the code can hurt my system, meaning:
(1) the code is not allowed to import ANY MODULE
(2) the code is not allowed to read or write any data, connect to the network etc
(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)
before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.
I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)
yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).
Then here comes the question:
(1) where can I get a 'global' list of python methods (like open)?
(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?
I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.
EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)
This point hasn't been made yet, and should be:
You are not going to be able to secure arbitrary Python code.
A VM is the way to go unless you want security issues up the wazoo.
You can still obfuscate import without using eval:
s = '__imp'
s += 'ort__'
f = globals()['__builtins__'].__dict__[s]
** BOOM **
Built-in functions.
Keywords.
Note that you'll need to do things like look for both "file" and "open", as both can open files.
Also, as others have noted, this isn't 100% certain to stop someone determined to insert malacious code.
An approach that should work better than string matching us to use module ast, parse the python code, do your whitelist filtering on the tree (e.g. allow only basic operations), then compile and run the tree.
See this nice example by Andrew Dalke on manipulating ASTs.
built in functions/keywords:
eval
exec
__import__
open
file
input
execfile
print can be dangerous if you have one of those dumb shells that execute code on seeing certain output
stdin
__builtins__
globals() and locals() must be blocked otherwise they can be used to bypass your rules
There's probably tons of others that I didn't think about.
Unfortunately, crap like this is possible...
object().__reduce__()[0].__globals__["__builtins__"]["eval"]("open('/tmp/l0l0l0l0l0l0l','w').write('pwnd')")
So it turns out keywords, import restrictions, and in-scope by default symbols alone are not enough to cover, you need to verify the entire graph...
Use a Virtual Machine instead of running it on a system that you are concerned about.
Without a sandboxed environment, it is impossible to prevent a Python file from doing harm to your system aside from not running it.
It is easy to create a Cryptominer, delete/encrypt/overwrite files, run shell commands, and do general harm to your system.
If you are on Linux, you should be able to use docker to sandbox your code.
For more information, see this GitHub issue: https://github.com/raxod502/python-in-a-box/issues/2.
I did come across this on GitHub, so something like it could be used, but that has a lot of limits.
Another approach would be to create another Python file which parses the original one, removes the bad code, and runs the file. However, that would still be hit-and-miss.

Are Python functions "compile" and "compiler.parse" safe (sandboxed)?

I plan to use those functions in web-environment, so my concern is if those functions can be exploited and used for executing malicious software on the server.
Edit: I don't execute the result. I parse the AST tree and/or catch SyntaxError.
This is the code in question:
try:
#compile the code and check for syntax errors
compile(code_string, filename, "exec")
except SyntaxError, value:
msg = value.args[0]
(lineno, offset, text) = value.lineno, value.offset, value.text
if text is None:
return [{"line": 0, "offset": 0,
"message": u"Problem decoding source"}]
else:
line = text.splitlines()[-1]
if offset is not None:
offset = offset - (len(text) - len(line))
else:
offset = 0
return [{"line": lineno, "offset": offset, "message": msg}]
else:
#no syntax errors, check it with pyflakes
tree = compiler.parse(code_string)
w = checker.Checker(tree, filename)
w.messages.sort(lambda a, b: cmp(a.lineno, b.lineno))
checker.Checker is pyflakes class that parses the AST tree.
I think the more interesting question is what are you doing with the compiled functions? Running them is definitely unsafe.
I've tested the few exploits i could think of seeing as its just a syntax checker (can't redefine classes/functions etc) i don't think there is anyway to get python to execute arbitrary code at compile time
If the resulting code or AST object is never evaluated, I think you are only subject to DDoS attacks.
If you are evaluating user inputed code, it is the same as giving shell access as the webserver user to every user.
They are not, but it's not too hard finding a subset of Python that can be sandboxed to a point. If you want to go down that road you need to parse that subset of Python yourself and intercept all calls, attribute lookups and everything else involved. You also don't want to give users access to any language construct such as unterminating loops and more.
Still interested? Head over to jinja2.sandbox :)
compiler.parse and compile could most definitely be used for an attack if the attacker can control their input and the output is executed. In most cases, you are going to either eval or exec their output to make it run so those are still the usual suspects and compile and compiler.parse (deprecated BTW) are just adding another step between the malicious input and the execution.
EDIT: Just saw that you left a comment indicating that you are actually planning on using these on USER INPUT. Don't do that. Or at least, don't actually execute the result. That's a huge security hole for whoever ends up running that code. And if nobody's going to run it, why compile it? Since you clarified that you only want to check syntax, this should be fine. I would not store the output though as there's no reason to make anything easier for a potential attacker and being able to get arbitrary code onto your system is a first step.
If you do need to store it, I would probably favor a scheme similar to that commonly used for images where they are renamed in a non-predictable manner with the added step of making sure that it is not stored on the import path.
Yes, they can be maliciously exploited.
If you really want safe sandboxing, you could look at PyPy's sandboxing features, but be aware that sandboxing is not easy, and there may be better ways to accomplish whatever you are seeking.
Correction
Since you've updated your question to clarify that you're only parsing the untrusted input to AST, there is no need to sandbox anything: sandboxing is specifically about executing untrusted code (which most people probably assumed your goal was, by asking about sandboxing).
Using compile / compiler only for parsing this way should be safe: Python source parsing does not have any hooks into code execution. (Note that this is not necessarily true of all languages: for example, Perl cannot be (completely) parsed without code execution.)
The only other remaining risk is that someone may be able to craft some pathological Python source code that makes one of the parsers use runaway amounts of memory / processor time, but resource exhaustion attacks affect everything, so you'll just want to manage this as it becomes necessary. (For example, if your deployment is mission-critical and cannot afford a denial of service by an attacker armed with pathological source code, you can execute the parsing in a resource-limited subprocess).

Categories

Resources