How to do a meaningful code-coverage analysis of my unit-tests?

How to do a meaningful code-coverage analysis of my unit-tests? - python

I manage the testing for a very large financial pricing system. Recently our HQ have insisted that we verify that every single part of our project has a meaningful test in place. At the very least they want a system which guarantees that when we change something we can spot unintentional changes to other sub-systems. Preferably they want something which validates the correctness of every component in our system.
That's obviously going to be quite a lot of work! It could take years, but for this kind of project it's worth it.
I need to find out which parts of our code are not covered by any of our unit-tests. If I knew which parts of my system were untested then I could set about developing new tests which would eventually approach towards my goal of complete test-coverage.
So how can I go about running this kind of analysis. What tools are available to me?
I use Python 2.4 on Windows 32bit XP
UPDATE0:
Just to clarify: We have a very comprehensive unit-test suite (plus a seperate and very comprehensive regtest suite which is outside the scope of this exercise). We also have a very stable continuous integration platform (built with Hudson) which is designed to split-up and run standard python unit-tests across our test facility: Approx 20 PCs built to the company spec.
The object of this exercise is to plug any gaps in our python unittest suite (only) suite so that every component has some degree of unittest coverage. Other developers will be taking responsibility for non Python components of the project (which are also outside of scope).
"Component" is intentionally vague: Sometime it will be a class, other time an entire module or assembly of modules. It might even refer to a single financial concept (e.g. a single type of financial option or a financial model used by many types of option). This cake can be cut in many ways.
"Meaningful" tests (to me) are ones which validate that the function does what the developer originally intended. We do not want to simply reproduce the regtests in pure python. Often the developer's intent is not immediatly obvious, hence the need to research and clarify anything which looks vague and then enshrine this knowledge in a unit-test which makes the original intent quite explicit.

For the code coverage alone, you could use coverage.py.
As for coverage.py vs figleaf:
figleaf differs from the gold standard
of Python coverage tools
('coverage.py') in several ways.
First and foremost, figleaf uses the
same criterion for "interesting" lines
of code as the sys.settrace function,
which obviates some of the complexity
in coverage.py (but does mean that
your "loc" count goes down). Second,
figleaf does not record code executed
in the Python standard library, which
results in a significant speedup. And
third, the format in which the
coverage format is saved is very
simple and easy to work with.
You might want to use figleaf if
you're recording coverage from
multiple types of tests and need to
aggregate the coverage in interesting
ways, and/or control when coverage is
recorded. coverage.py is a better
choice for command-line execution, and
its reporting is a fair bit nicer.
I guess both have their pros and cons.

First step would be writing meaningfull tests. If you'll be writing tests only meant to reach full coverage, you'll be counter-productive; it will probably mean you'll focus on unit's implementation details instead of it's expectations.
BTW, I'd use nose as unittest framework (http://somethingaboutorange.com/mrl/projects/nose/0.11.1/); it's plugin system is very good and leaves coverage option to you (--with-coverage for Ned's coverage, --with-figleaf for Titus one; support for coverage3 should be coming), and you can write plugisn for your own build system, too.

FWIW, this is what we do. Since I don't know about your Unit-Test and Regression-Test setup, you have to decide yourself whether this is helpful.
Every Python package has
UnitTests.
We automatically detect unit tests using nose. Nose automagically detects standard Python unit tests (basically everything that looks like a test). Thereby we don't miss unit-tests. Nose also has a plug-in concept so that you can produce, e.g. nice output.
We strive for 100% coverage for
unit-testing. To this end, we use
Coverage
to check, because a nose-plugin provides integration.
We have set up Eclipse (our IDE) to automatically run nose whenever a file changes so that the unit-tests always get executed, which shows code-coverage as a by-product.

"every single part of our project has a meaningful test in place"
"Part" is undefined. "Meaningful" is undefined. That's okay, however, since it gets better further on.
"validates the correctness of every component in our system"
"Component" is undefined. But correctness is defined, and we can assign a number of alternatives to component. You only mention Python, so I'll assume the entire project is pure Python.
Validates the correctness of every module.
Validates the correctness of every class of every module.
Validates the correctness of every method of every class of every module.
You haven't asked about line of code coverage or logic path coverage, which is a good thing. That way lies madness.
"guarantees that when we change something we can spot unintentional changes to other sub-systems"
This is regression testing. That's a logical consequence of any unit testing discipline.
Here's what you can do.
Enumerate every module. Create a unittest for that module that is just a unittest.main(). This should be quick -- a few days at most.
Write a nice top-level unittest script that uses a testLoader to all unit tests in your tests directory and runs them through the text runner. At this point, you'll have a lot of files -- one per module -- but no actual test cases. Getting the testloader and the top-level script to work will take a few days. It's important to have this overall harness working.
Prioritize your modules. A good rule is "most heavily reused". Another rule is "highest risk from failure". Another rule is "most bugs reported". This takes a few hours.
Start at the top of the list. Write a TestCase per class with no real methods or anything. Just a framework. This takes a few days at most. Be sure the docstring for each TestCase positively identifies the Module and Class under test and the status of the test code. You can use these docstrings to determine test coverage.
At this point you'll have two parallel tracks. You have to actually design and implement the tests. Depending on the class under test, you may have to build test databases, mock objects, all kinds of supporting material.
Testing Rework. Starting with your highest priority untested module, start filling in the TestCases for each class in each module.
New Development. For every code change, a unittest.TestCase must be created for the class being changed.
The test code follows the same rules as any other code. Everything is checked in at the end of the day. It has to run -- even if the tests don't all pass.
Give the test script to the product manager (not the QA manager, the actual product manager who is responsible for shipping product to customers) and make sure they run the script every day and find out why it didn't run or why tests are failing.
The actual running of the master test script is not a QA job -- it's everyone's job. Every manager at every level of the organization has to be part of the daily build script output. All of their jobs have to depend on "all tests passed last night". Otherwise, the product manager will simply pull resources away from testing and you'll have nothing.

Assuming you already have a relatively comprehensive test suite, there are tools for the python part. The C part is much more problematic, depending on tools availability.
For python unit tests
For C code, it is difficult on many platforms because gprof, the Gnu code profiler cannot handle code built with -fPIC. So you have to build every extension statically in this case, which is not supported by many extensions (see my blog post for numpy, for example). On windows, there may be better code coverage tools for compiled code, but that may require you to recompile the extensions with MS compilers.
As for the "right" code coverage, I think a good balance it to avoid writing complicated unit tests as much as possible. If a unit test is more complicated than the thing it tests, then it is a probably not a good test, or a broken test.

Related

What's a more Pythonic way to test parts of my code?

I'm on Windows 10, Python 2.7.13 installed via Anaconda. Recently I've been writing a lot of scripts to read/write data from files to other files, move them around, and do some visualizations with matplotlib. My workflow has been having an Anaconda Prompt open next to Sublime Text, and I copy/paste individual lines into my workspace to test something. This doesn't feel like a "best practice", especially because I can't copy/paste multiple lines with indents, so I have to write them out manually twice. I'd really like to find a better way to work on this. What would you recommend changing?

There are several types of software testing that vary in their complexity and what they test. Generally speaking, it is a good practice to leverage what is know as unit testing. Unit testing is the methodology of writing groups of tests where each test is responsible for testing a small "unit" of code. By only testing individual pieces of your project with each test, you are given a very granular idea of what parts of your project is working correctly and what parts are not working correctly. It also allows for your tests to be repeatable, source controlled, and automated. Typically each "unit" that a test is written for is a single callable item such as a function or method of a class.
In order to get the most out of unit testing, your functions and methods need to be single responsibility entities. This means they should perform one task and one task only. This makes it much easier to test them. Python's standard library has a built package, appropriately named unittest to perform this type of testing..
I would start looking at the unittest package's documentation. It provides more explanation on unit testing and how to use the package in your python code. You can also use the coverage package to determine how much of you code is tested via unit tests.
I hope this helps.

How To Determine Which Unit Test Covered a Function or Method

From a Python perspective, how can one determine the unit test(s) which covered a function or method, or generally any line of code that was hit by a test in the suite by the test runner? It seems reasonable that this information should be at hand given the coverage tools know the specific code that was hit, but I cannot find any way to get at this information (I am using py.test as my test runner with the coverage and pytest-cov modules).
One approach I have found is to just put a pdb.set_trace call into the code, but it would be really helpful if I could find a more elegant way that didn't require modifying the code under test.

Coverage.py doesn't yet provide this feature, but there's an open ticket where we are kicking around ideas: https://github.com/nedbat/coveragepy/issues/170
To read the old history of this issue, check out the old ticket (in the BitBucket tracker)

Smother is a wrapper utility around coverage.py that measures code coverage separately for each test in a test suite. Its main features include:
Fast and reliable coverage tracking using coverage.py.
Ability to lookup which tests visit an arbitrary section of your application code.
Ability to convert version control diffs into a subset of affected tests to rerun.
It supports py.test and nose.

I don't know whether this code is still functional in the current ecosystem (in particular current coverage version and nose / nose 2 plugin APIs), but the the figleaf-sections plugin from the figleaf package provides this feature.
http://darcs.idyll.org/~t/projects/figleaf/doc/
If I recall, it was a nice proof of concept and I'm sure useful, but I think there were a few rough edges, I'd love it if somebody picked up the idea and really made it work smoothly!

Is it acceptable practice to unit-test a program in a different language?

I have a static library I created from C++, and would like to test this using a Driver code.
I noticed one of my professors like to do his tests using python, but he simply executes the program (not a library in this case, but an executable) using random test arguments.
I would like to take this approach, but I realized that this is a library and doesn't have a main function; that would mean I should either create a Driver.cpp class, or wrap the library into python using SWIG or boost python.
I’m planning to do the latter because it seems more fun, but logically, I feel that there is going to be more bugs when trying to wrap a library to a different language just to test it, rather than test it in its native language.
Is testing programs in a different language an accepted practice in the real world, or is this bad practice?

I'd say that it's best to test the API that your users will be exposed to. Other tests are good to have as well, but that's the most important aspect.
If your users are going to write C/C++ code linking to your library, then it would be good to have tests making use of your library the same way.
If you are going to ship a Python wrapper (why not?) then you should have Python tests.
Of course, there is a convenience aspect to this, as well. It may be easier to write tests in Python, and you might have time constraints that make it more appealing, etc.
I guess what I'm saying is: There's nothing inherently wrong with tests being in a different language from the code under test (that's totally normal for testing a REST API, for instance), but make sure you have tests for the public-facing API at a minimum.
Aside, on terminology:
I don't think the types of tests you are describing are "unit tests" in the usual sense of the term. Probably "functional test" would be more accurate.
A unit test typically tests a very small component - such as a function call - that might be one piece of larger functionality. Unit tests like these are often "white box" tests, so you can see the inner workings of your code.
Testing something from a user's point-of-view (such as your professor's commandline tests) are "black box" tests, and in these examples are at a more functional level rather than "unit" level.
I'm sure plenty of people may disagree with that, though - it's not a rigidly-defined set of terms.

A few things to keep in mind:
If you are writing tests as you code, then, by all means, use whatever language works best to give you rapid feedback. This enables fast test-code cycles (and is fun as well). BUT.
Always have well-written tests in the language of the consumer. How is your client/consumer going to call your functions? What language will they be using? Using the same language minimizes integration issues later on in the life-cycle.

It really depends on what it is you are trying to test. It almost always makes sense to write unit tests in the same language as the code you are testing so that you can construct the objects under test or invoke the functions under test, both of which can be most easily done in the same language, and verify that they work correctly. There are, however, cases in which it makes sense to use a different language, namely:
Integration tests that run a number of different components or applications together.
Tests that verify compilation or interpretation failures which could not be tested in the language, itself, since you are validating that an error occurs at the language level.
An example of #1 might be a program that starts up multiple different servers connected to each other, issues requests to the server, and verifies those responses. Or, as a simpler example, a program that simply forks an application under test as a subprocess and verifies that it produces the expected outputs for a given input.
An example of #2 might be a program that verifies that a certain piece of C++ code will produce a static assertion failure or that a particular template instantiation which is intentionally disallowed will result in a compilation failure if someone attempts to use it.
To answer your larger question, it is not bad practice per-se to write tests in a different language. Whatever makes the tests more convenient to write, easier to understand, more robust to changes in implementation, more sensitive to regressions, and better on any one of the properties that define good testing would be a good justification to write the tests one way vs another. If that means writing the tests in another language, then go for it. That being said, small unit tests typically need to be able to invoke the item under test directly which, in most cases, means writing the unit tests in the same language as the component under test.

I would say it depends on what you're actually trying to test. For true unit testing, it is, I think, best to test in the same language, or at least a binary-compatible language (i.e. testing Java with Groovy -- I use Spock in this case, which is Groovy based, to unit-test my Java code, since I can intermingle the Java with the Groovy), but if you are testing results, then I think it's fair to switch languages.
For example, I have tested the expected results when given a specific set of a data when running a Perl application via nose in Python. This works because I'm not unit testing the Perl code, per se, but the outcomes of that Perl code.
In that case, to unit test actual Perl functions that are part of the application, I would use a Perl-based test framework such as Test::More.

Why not, it's an awesome idea because you really understand that you are testing the unit like a black box.
Of course there may be technical issues involved, what if you need to mock some parts of the unit under test, that may be difficult in a different language.
This is a common practice for integration tests though, I've seen lots of programs driven from external tools such as a website from selenium, or an application from cucumber. Both those can be considered the same as a custom python script.
If you consider the difference between integration testing and unit testing is the number of things under test at any given time, the only reason why you shouldn't do this is tool support.

How do i test/refactor my tests?

I have a test suit for my app. As the test suit grew organically, the tests have a lot of repeated code which can be refactored.
However I would like to ensure that the test suite doesn't change with the refactor. How can test that my tests are invariant with the refactor.
(I am using Python+UnitTest), but I guess the answer to this can be language agnostic.

The real test for the tests is the production code.
An effective way to check that a test code refactor hasn't broken your tests would be to do Mutation Testing, in which a copy of the code under test is mutated to introduce errors in order to verify that your tests catch the errors. This is a tactic used by some test coverage tools.
I haven't used it (and I'm not really a python coder), but this seems to be supported by the Python Mutant Tester, so that might be worth looking at.

Coverage.py is your friend.
Move over all the tests you want to refactor into "system tests" (or some such tag). Refactor the tests you want (you would be doing unit tests here right?) and monitor the coverage:
After running your new unit tests but before running the system tests
After running both the new unit tests and the system tests.
In an ideal case, the coverage would be same or higher but you can thrash your old system tests.
FWIW, py.test provides mechanism for easily tagging tests and running only the specific tests and is compatible with unittest2 tests.

Interesting question - I'm always keen to hear discussions of the type "how do I test the tests?!". And good points from #marksweb above too.
It's always a challenge to check your tests are actually doing what you want them to do and testing what you intend, but good to get this right and do it properly. I always try to consider the rule-of-thumb that testing should make up 1/3 of development effort in any project... regardless of project time constraints, pressures and problems that inevitably crop up.
If you intend to continue and grow your project have you considered refactoring like you say, but in a way that creates a proper test framework that allows test driven development (TDD) of any future additions of functionality or general expansion of the project?

Although you have mentioned Python, I would like to comment how refactoring is applied in Smalltalk. Most modern Smalltalk implementations include a "Refactoring Browser" integrated in a System Browser to restructure source code. The RB includes a rewrite framework to perform dynamically the transformations you asked about preserving the system behavior and stability. A way to use it is to open a scoped browser, apply refactorings and review/edit changes before commiting through a diff tool. I don't know about maturity of Python refactoring tools, but it took many iteration cycles (years) for the Smalltalk community to have such an amazing piece of software.
Don Roberts and John Brant wrote one of the first refactoring browser tools which now serves as the standard for refactoring tools. There are some videos and here demonstrating some of these features. For promoting a method into a superclass, in Pharo you just select the method, refactor and "pull up" menu item. The rule will detect and let you review the proposed duplicated sub-implementors for deletion before its execution. Application of refactorings are regardless of Testing code.

In theory you could write a test for the test, mocking the actualy object under test.But I guess that is just way to much work and not worth it.
So what you are left with are some strategies, that will help, but not make this fail safe.
Work very carefully and slowly. Use the features of you IDEs as much as possible in order to limit the chance of human error.
Work in pairs. A partner looking over your shoulder might just spot the glitch that you missed.
Copy the test, then refactor it. When done introduce errors in the production code to ensure, both tests find the the problem in the same (or equivalent) ways. Only then remove the original test.
The last step can be done by tools, although I don't know the python flavors. The keyword to search for is 'mutation testing'.
Having said all that, I'm personally satisfied with steps 1+2.

I can't see an easy way to refactor a test suite, and depending on the extent of your refactor you're obviously going to have to change the test suite. How big is your test suite?
Refactoring properly takes time and attention to detail (and a lot of Ctrl+C Ctrl+V!). Whenever I've refactored my tests I don't try and find any quick ways of doing things, besides find & replace, because there is too much risk involved.
You're best of doing things properly and manually albeit slowly if you want to make keep the quality of your tests.

Don't refactor the test suite.
The purpose of refactoring is to make it easier to maintain the code, not to satisfy some abstract criterion of "code niceness". Test code doesn't need to be nice, it doesn't need to avoid repetition, but it does need to be thorough. Once you have a test that is valid (i.e. it really does test necessary conditions on the code under test), you should never remove it or change it, so test code doesn't need to be easy to maintain en masse.
If you like, you can rewrite the existing tests to be nice, and run the new tests in addition to the old ones. This guarantees that the new combined test suite catches all the errors that the old one did (and maybe some more, as you expand the new code in future).
There are two ways that a test can be deemed invalid -- you realise that it's wrong (i.e. it sometimes fails falsely for correct code under test), or else the interface under test has changed (to remove the API tested, or to permit behaviour that previously was a test failure). In that case you can remove a test from the suite. If you realise that a whole bunch of tests are wrong (because they contain duplicated code that is wrong), then you can remove them all and replace them with a refactored and corrected version. You don't remove tests just because you don't like the style of their source.
To answer your specific question: to test that your new test code is equivalent to the old code, you would have to ensure (a) all the new tests pass on your currently-correct-as-far-as-you-known code base, which is easy, but also (b) the new tests detect all the errors that the old tests detect, which is usually not possible because you don't have on hand a suite of faulty implementations of the code under test.

Test code can be the best low level documentation of your API since they do not outdate as long as they pass and are correct. But messy test code doesn't serve that purpose very well. So refactoring is essential.
Also might your tested code change over time. So do the tests. If you want that to be smooth, code duplication must be minimized and readability is a key.
Tests should be easy to read and always test one thing at once and make the follwing explicit:
what are the preconditions?
what is being executed?
what is the expected outcome?
If that is considered, it should be pretty safe to refactor the test code. One step at a time and, as #Don Ruby mentioned, let your production code be the test for the test.
For many refactoring you can often safely rely on advanced IDE tooling – if you beware of side effects in the extracted code.
Although I agree that refactoring without proper test coverage should be avoided, I think writing tests for your tests is almost absurd in usual contexts.

Using Python code coverage tool for understanding and pruning back source code of a large library

My project targets a low-cost and low-resource embedded device. I am dependent on a relatively large and sprawling Python code base, of which my use of its APIs is quite specific.
I am keen to prune the code of this library back to its bare minimum, by executing my test suite within a coverage tools like Ned Batchelder's coverage or figleaf, then scripting removal of unused code within the various modules/files. This will help not only with understanding the libraries' internals, but also make writing any patches easier. Ned actually refers to the use of coverage tools to "reverse engineer" complex code in one of his online talks.
My question to the SO community is whether people have experience of using coverage tools in this way that they wouldn't mind sharing? What are the pitfalls if any? Is the coverage tool a good choice? Or would I be better off investing my time with figleaf?
The end-game is to be able to automatically generate a new source tree for the library, based on the original tree, but only including the code actually used when I run nosetests.
If anyone has developed a tool that does a similar job for their Python applications and libraries, it would be terrific to get a baseline from which to start development.
Hopefully my description makes sense to readers...

What you want isn't "test coverage", it is the transitive closure of "can call" from the root of the computation. (In threaded applications, you have to include "can fork").
You want to designate some small set (perhaps only 1) of functions that make up the entry points of your application, and want to trace through all possible callees (conditional or unconditional) of that small set. This is the set of functions you must have.
Python makes this very hard in general (IIRC, I'm not a deep Python expert) because of dynamic dispatch and especially due to "eval". Reasoning about what function can get called can be pretty tricky for a static analyzers applied to highly dynamic languages.
One might use test coverage as a way to seed the "can call" relation with specific "did call" facts; that could catch a lot of dynamic dispatches (dependent on your test suite coverage). Then the result you want is the transitive closure of "can or did" call. This can still be erroneous, but is likely to be less so.
Once you get a set of "necessary" functions, the next problem will be removing the unnecessary functions from the source files you have. If the number of files you start with is large, the manual effort to remove the dead stuff may be pretty high. Worse, you're likely to revise your application, and then the answer as to what to keep changes. So for every change (release), you need to reliably recompute this answer.
My company builds a tool that does this analysis for Java packages (with appropriate caveats regarding dynamic loads and reflection): the input is a set of Java files and (as above) a designated set of root functions. The tool computes the call graph, and also finds all dead member variables and produces two outputs: a) the list of purportedly dead methods and members, and b) a revised set of files with all the "dead" stuff removed. If you believe a), then you use b). If you think a) is wrong, then you add elements listed in a) to the set of roots and repeat the analysis until you think a) is right. To do this, you need a static analysis tool that parse Java, compute the call graph, and then revise the code modules to remove the dead entries. The basic idea applies to any language.
You'd need a similar tool for Python, I'd expect.
Maybe you can stick to just dropping files that are completely unused, although that may still be a lot of work.

As others have pointed out, coverage can tell you what code has been executed. The trick for you is to be sure that your test suite truly exercises the code fully. The failure case here is over-pruning because your tests skipped some code that will really be needed in production.
Be sure to get the latest version of coverage.py (v3.4): it adds a new feature to indicate files that are never executed at all.
BTW:: for a first cut prune, Python provides a neat trick: remove all the .pyc files in your source tree, then run your tests. Files that still have no .pyc file were clearly not executed!

I haven't used coverage for pruning out, but it seems like it should do well. I've used the combination of nosetests + coverage, and it worked better for me than figleaf. In particular, I found the html report from nosetests+coverage to be helpful -- this should be helpful to you in understanding where the unused portions of the library are.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.