Looking to improve quality of a fairly large Python project. I am happy with the types of warnings PyLint gives me. However, they are just too numerous and hard to enforce across a large organization. Also I believe that some code is more critical/sensitive than others with respect to where the next bug may come. For example I would like to spend more time validating a library method that is used by 100 modules rather than a script that was last touched 2 years ago and may not be used in production. Also it would be interesting to know modules that are frequently updated.
Is anyone familiar with tools for Python or otherwise that help with this type of analysis?
You problem is similar to the one I answered over at SQA https://sqa.stackexchange.com/a/3082. This problem was associated with Java which made the tooling a bit easier, but I have a number of suggestions below.
A number of other answers suggest that there is no good runtime tools for Python. I disagree on this in several ways:
Coverage tools work very well
Based on my experience in tooling in Java, static and dynamic analysis tools in Python are weaker than in a strongly typed less dynamic language but will work more than well enough to give good heuristics for you here. Unless you use an unusually large pathological number of dynamic features (including adding and removing methods, intercepting method and property invocations, playing with import, manually modifying the namespace) - in which case any problems you have may well be associated with this dynamism...
Pylint picks up simpler problems, and will not detect problems with dynamic class/instance modifications and decorators - so it doesn't matter that the metric tools don't measure these
In any case, where you can usefully focus can be determined by much more than a dependency graph.
Heuristics for selecting code
I find that there are a number of different considerations for selecting code for improvement which work both individually and together. Remember that, at first, all you need to do is find a productive seam of work - you don't need to find the absolutely worst code before you start.
Use your judgement.
After a few cycles through the codebase, you will have a huge amount of information and be much better positioned to continue your work - if indeed more needs to be done.
That said, here are my suggestions:
High value to the business: For example any code that could cost your company a lot of money. Many of these may be obvious or widely known (because they are important), or they may be detected by running the important use cases on a system with the run-time profiler enabled. I use Coverage.
Static code metrics: There are a lot of metrics, but the ones that concern us are:
High afferent couplings. This is code that a lot of other files depends on. While I don't have a tool that directly outputs this, snakefood is a good way to dump the dependencies directly to file, one line per dependency, each being a tuple of afferent and efferent file. I hate to say it, but computing the afferent coupling value from this file is a simple exercise left to the reader.
High McCabe (cyclomatic) complexity: This is more complex code. PyMetrics seems to produce this measure although I have not used the tool.
Size: You can get a surprising amount of information by viewing the size of your project using a visualiser (eg https://superuser.com/questions/8248/how-can-i-visualize-the-file-system-usage-on-windows or https://superuser.com/questions/86194/good-program-to-visualize-file-system-usage-on-mac?lq=1. Linux has KDirStat at Filelight). Large files are a good place to start as fixing one file fixes many warnings.
Note that these tools are file-based. This is probably fine-enough resolution since you mention the project is itself has hundreds of modules (files).
Changes frequently: Code that changes frequently is highly suspect. The code may:
Historically have had many defects, and empirically may continue to do so
Be undergoing changes from feature development (high number of revisions in your VCS)
Find areas of change using a VCS visualisation tool such as those discussed later in this answer.
Uncovered code: Code not covered by tests.
If you run (or can run) your unit tests, your other automated tests and typical user tests with coverage, take a look at the packages and files with next to no coverage. There are two logical reasons why there is no coverage:
The code is needed (and important) but not tested at all (at least automatically). These areas are extremely high risk
The code may be unused and is a candidate for removal.
Ask other developers
You may be surprised at the 'smell' metrics you can gather by having a coffee with the longer-serving developers. I bet they will be very happy if someone cleans up a dirty area of the codebase where only the bravest souls will venture.
Visibility - detecting changes over time
I am assuming that your environment has a DVCS (such as Git or Mercurial) or at least a VCS (eg SVN). I hope that you are also using an issue or bug tracker of some kind. If so, there is a huge amount of information available. It's even better if developers have reliably checked in with comments and issue numbers. But how do you visualise it and use it?
While you can tackle the problem on a single desktop, it is probably a good idea to set up a Continuous Integration (CI) environment, perhaps using a tool like Jenkins. To keep the answer short, I will assume Jenkins from now on. Jenkins comes with a large number of plugins that really help with code analysis. I use:
py.test with JUnit test output picked up by the JUnit test report Jenkins plugin
Coverage with the Cobertura plugin
SLOCCount and SLOCCount plugin
Pylint and Violations plugin
Apparently there is a plugin for McCabe (cyclometric) complexity for Python, although I have not used it. It certainly looks interesting.
This gives me visibility of changes over time, and I can drill in from there. For example, suppose PyLint violations start increasing in a module - I have evidence of the increase, and I know the package or file in which this is occurring, so I can find out who's involved and go speak with them.
If you need historic data and you have just installed Jenkins, see if you can run a few manual builds that start at the beginning of the project and take a series of jumps forward in time until the present. You can choose milestone release tags (or dates) from the VCS.
Another important area, as mentioned above, is detecting the loci of changes in the code base. I have really liked Atlassian Fisheye for this. Besides being really good at searching for commit messages (eg bug id) or file contents at any point in time, it allows me to easily see metrics:
Linecount by directory and subdirectory
Committers at any point in time or in specific directories and/or files
Patterns of committal, both by time and also location in the source code
I'm afraid you are mostly on your own.
If you have decent set of tests, look at code coverage and dead code.
If you have a decent profiling setup, use that to get a glimpse of what's used more.
In the end, it seems you are more interested in fan-in/fan-out analysis, I'm not aware of any good tools for Python, primarily because static analysis is horribly unreliable against a dynamic language, and so far I didn't see any statistical analysis tools.
I reckon that this information is sort of available in JIT compilers -- whatever (function, argument types) is in cache (compiled) those are used the most. Whether or not you can get this data out of e.g. PyPy I really don't have a clue.
Source control tools can give a good indication of frequently updated modules - often indicating trouble spots.
If you don't have source control but the project is run from a shared location delete all the pycache folders or .pyc files. Over time/under use watch which files get recreated to indicate their use.
Analysing the Python imports printed when running from particular entry points with
python -v entry_point
may give some insight into which modules are being used. Although if you have known access points you should try the coverage module.
For a more intrusive solution, consider setting up project wide logging. You can log metrics easy enough, even over distributed programs.
I agree with the others, in that I have yet to come across a good runtime analysis tool for Python that will do this.
There are some ways to deal with it, but none are trivial.
The most robust, I think, would be to get the Python source and recompile the binaries with some sort of built-in runtime logging. That way you could just slide it into the existing environment without any code changes to your project.
Of course, that isn't exactly trivial to do, but it has the bonus that you might some day be able to get that merged back into the trunk for future generations and what not.
For non-recompile approaches, the first place I would look is the profile library's deterministic profiling section.
How you implement it will be heavily dependent on how your environment is set up. Do you have many individual scripts and projects run independently of one another, or just the one main script or module or package that gets used by everybody else, and you just want to know what parts of it can be trimmed out to make maintenance easier?
Is it a load once, run forever kind of set up, or a situation where you just run scripts atomically on some sort of schedule?
You could implement project-wide logging (as mentioned in #Hardbyte's answer), but that would require going through the project and adding the logging lines to all of your code. If you do that, you may as well just do it using the built-in profiler, I think.
Have a look at sys.setprofile: it allows you to install a profiler function.
Its usage is detailed in http://docs.python.org/library/profile.html#profile, for a jumpstart go here.
If you can not profile your application you will be bound to the cooverage approach.
Another thing you might have a look at is decorators, you can write a debugging decorator, and apply it to set of functions you suspect. Take alook here to see how to apply the decorator to an entire module.
You might also take a look at python call graph, while it will not generate quite what you want it shows you how often one function calls another:
If your code runs on user input this will be hard, since you would have to simulate 'typical' usage.
There is not more to tell you, just remember profiling as keyword.
Pylint sometimes gives warnings that (after careful consideration) are not justified. In which case it is useful to make use of the special #pylint: disable=X0123 comments (where X0123 is the actual error/warning message number) if the code cannot be refactored to not trigger the warning.
I'd like to second Hardbyte's mention of using your source control logs to see which files are most often changed.
If you are working on a system that has find, grep and sort installed, the following is a way to check which file imports what;
find . -name '*.py' -exec grep -EH '^import|^from .* import' {} \+| sort |less
To find the most popular imports across all files;
find . -name '*.py' -exec grep -Eh '^import|^from .* import' {} \+ | sort | less
These two commands should help you find the most-used modules from your project.
Related
So you've got some legacy code lying around in a fairly hefty project. How can you find and delete dead functions?
I've seen these two references: Find unused code and Tool to find unused functions in php project, but they seem specific to C# and PHP, respectively.
Is there a Python tool that'll help you find functions that aren't referenced anywhere else in the source code (notwithstanding reflection/etc.)?
In Python you can find unused code by using dynamic or static code analyzers. Two examples for dynamic analyzers are coverage and figleaf. They have the drawback that you have to run all possible branches of your code in order to find unused parts, but they also have the advantage that you get very reliable results.
Alternatively, you can use static code analyzers that just look at your code, but don't actually run it. They run much faster, but due to Python's dynamic nature the results may contain false positives.
Two tools in this category are pyflakes and vulture. Pyflakes finds unused imports and unused local variables. Vulture finds all kinds of unused and unreachable code. (Full disclosure: I'm the maintainer of Vulture.)
The tools are available in the Python Package Index https://pypi.org/.
I'm not sure if this is helpful, but you might try using the coverage, figleaf or other similar modules, which record which parts of your source code is used as you actually run your scripts/application.
Because of the fairly strict way python code is presented, would it be that hard to build a list of functions based on a regex looking for def function_name(..) ?
And then search for each name and tot up how many times it features in the code. It wouldn't naturally take comments into account but as long as you're having a look at functions with less than two or three instances...
It's a bit Spartan but it sounds like a nice sleepy-weekend task =)
unless you know that your code uses reflection, as you said, I would go for a trivial grep. Do not underestimate the power of the asterisk in vim as well (performs a search of the word you have under your cursor in the file), albeit this is limited only to the file you are currently editing.
Another solution you could implement is to have a very good testsuite (seldomly happens, unfortunately) and then wrap the routine with a deprecation routine. if you get the deprecation output, it means that the routine was called, so it's still used somewhere. This works even for reflection behavior, but of course you can never be sure if you don't trigger the situation when your routine call is performed.
its not only searching function names, but also all the imported packages not in use.
you need to search the code for all the imported packages (including aliases) and search used functions, then create a list of the specific imports from each package (example instead of import os, replace with from os import listdir, getcwd,......)
I have searched and searched to find the tool that I need but I can't seem to find one. I'm sure it exists though, so I will explain my issue and hopefully someone can help me.
Cookiecutter and other scaffolding tools create an initial project folder structure, filenames, etc. when starting a new project. These automate a boring process but also ensure consistency.
I have an issue whereby I create a new project with multiple people working on it. It can be setup correctly but not everyone will then follow the convention, which is very frustrating and leads to errors, versioning problems and an inability for people to re-trace or recreate work they have done. I have decided I need to check that people are adhering to the convention once the project is underway. This is another boring, laborious task that could be automated with a clever tool.
I would like a tool that not only sets up the initial project from template, but can also be run at various intervals to determine if a naming structure is being followed or not. I can then get a quick understanding of where the violations have occurred and I can address it with the person involved.
I've looked at cookiecutter which will certainly be suitable for the initial setup, but I wonder if there is another tool I am missing that will subsequently help me check that conventions are being followed? If not I am sure I can modify cookiecutter to get it to do what I want, but an off-the-shelf solution would be ideal.
I should note that these are not necessarily text files with code. There will be versions of datasets, large binary files, some scripting files (python/matlab), images, etc. The naming convention then becomes important because I am using it for versioning, search and other problems that git simply solves.
My project targets a low-cost and low-resource embedded device. I am dependent on a relatively large and sprawling Python code base, of which my use of its APIs is quite specific.
I am keen to prune the code of this library back to its bare minimum, by executing my test suite within a coverage tools like Ned Batchelder's coverage or figleaf, then scripting removal of unused code within the various modules/files. This will help not only with understanding the libraries' internals, but also make writing any patches easier. Ned actually refers to the use of coverage tools to "reverse engineer" complex code in one of his online talks.
My question to the SO community is whether people have experience of using coverage tools in this way that they wouldn't mind sharing? What are the pitfalls if any? Is the coverage tool a good choice? Or would I be better off investing my time with figleaf?
The end-game is to be able to automatically generate a new source tree for the library, based on the original tree, but only including the code actually used when I run nosetests.
If anyone has developed a tool that does a similar job for their Python applications and libraries, it would be terrific to get a baseline from which to start development.
Hopefully my description makes sense to readers...
What you want isn't "test coverage", it is the transitive closure of "can call" from the root of the computation. (In threaded applications, you have to include "can fork").
You want to designate some small set (perhaps only 1) of functions that make up the entry points of your application, and want to trace through all possible callees (conditional or unconditional) of that small set. This is the set of functions you must have.
Python makes this very hard in general (IIRC, I'm not a deep Python expert) because of dynamic dispatch and especially due to "eval". Reasoning about what function can get called can be pretty tricky for a static analyzers applied to highly dynamic languages.
One might use test coverage as a way to seed the "can call" relation with specific "did call" facts; that could catch a lot of dynamic dispatches (dependent on your test suite coverage). Then the result you want is the transitive closure of "can or did" call. This can still be erroneous, but is likely to be less so.
Once you get a set of "necessary" functions, the next problem will be removing the unnecessary functions from the source files you have. If the number of files you start with is large, the manual effort to remove the dead stuff may be pretty high. Worse, you're likely to revise your application, and then the answer as to what to keep changes. So for every change (release), you need to reliably recompute this answer.
My company builds a tool that does this analysis for Java packages (with appropriate caveats regarding dynamic loads and reflection): the input is a set of Java files and (as above) a designated set of root functions. The tool computes the call graph, and also finds all dead member variables and produces two outputs: a) the list of purportedly dead methods and members, and b) a revised set of files with all the "dead" stuff removed. If you believe a), then you use b). If you think a) is wrong, then you add elements listed in a) to the set of roots and repeat the analysis until you think a) is right. To do this, you need a static analysis tool that parse Java, compute the call graph, and then revise the code modules to remove the dead entries. The basic idea applies to any language.
You'd need a similar tool for Python, I'd expect.
Maybe you can stick to just dropping files that are completely unused, although that may still be a lot of work.
As others have pointed out, coverage can tell you what code has been executed. The trick for you is to be sure that your test suite truly exercises the code fully. The failure case here is over-pruning because your tests skipped some code that will really be needed in production.
Be sure to get the latest version of coverage.py (v3.4): it adds a new feature to indicate files that are never executed at all.
BTW:: for a first cut prune, Python provides a neat trick: remove all the .pyc files in your source tree, then run your tests. Files that still have no .pyc file were clearly not executed!
I haven't used coverage for pruning out, but it seems like it should do well. I've used the combination of nosetests + coverage, and it worked better for me than figleaf. In particular, I found the html report from nosetests+coverage to be helpful -- this should be helpful to you in understanding where the unused portions of the library are.
I manage the testing for a very large financial pricing system. Recently our HQ have insisted that we verify that every single part of our project has a meaningful test in place. At the very least they want a system which guarantees that when we change something we can spot unintentional changes to other sub-systems. Preferably they want something which validates the correctness of every component in our system.
That's obviously going to be quite a lot of work! It could take years, but for this kind of project it's worth it.
I need to find out which parts of our code are not covered by any of our unit-tests. If I knew which parts of my system were untested then I could set about developing new tests which would eventually approach towards my goal of complete test-coverage.
So how can I go about running this kind of analysis. What tools are available to me?
I use Python 2.4 on Windows 32bit XP
UPDATE0:
Just to clarify: We have a very comprehensive unit-test suite (plus a seperate and very comprehensive regtest suite which is outside the scope of this exercise). We also have a very stable continuous integration platform (built with Hudson) which is designed to split-up and run standard python unit-tests across our test facility: Approx 20 PCs built to the company spec.
The object of this exercise is to plug any gaps in our python unittest suite (only) suite so that every component has some degree of unittest coverage. Other developers will be taking responsibility for non Python components of the project (which are also outside of scope).
"Component" is intentionally vague: Sometime it will be a class, other time an entire module or assembly of modules. It might even refer to a single financial concept (e.g. a single type of financial option or a financial model used by many types of option). This cake can be cut in many ways.
"Meaningful" tests (to me) are ones which validate that the function does what the developer originally intended. We do not want to simply reproduce the regtests in pure python. Often the developer's intent is not immediatly obvious, hence the need to research and clarify anything which looks vague and then enshrine this knowledge in a unit-test which makes the original intent quite explicit.
For the code coverage alone, you could use coverage.py.
As for coverage.py vs figleaf:
figleaf differs from the gold standard
of Python coverage tools
('coverage.py') in several ways.
First and foremost, figleaf uses the
same criterion for "interesting" lines
of code as the sys.settrace function,
which obviates some of the complexity
in coverage.py (but does mean that
your "loc" count goes down). Second,
figleaf does not record code executed
in the Python standard library, which
results in a significant speedup. And
third, the format in which the
coverage format is saved is very
simple and easy to work with.
You might want to use figleaf if
you're recording coverage from
multiple types of tests and need to
aggregate the coverage in interesting
ways, and/or control when coverage is
recorded. coverage.py is a better
choice for command-line execution, and
its reporting is a fair bit nicer.
I guess both have their pros and cons.
First step would be writing meaningfull tests. If you'll be writing tests only meant to reach full coverage, you'll be counter-productive; it will probably mean you'll focus on unit's implementation details instead of it's expectations.
BTW, I'd use nose as unittest framework (http://somethingaboutorange.com/mrl/projects/nose/0.11.1/); it's plugin system is very good and leaves coverage option to you (--with-coverage for Ned's coverage, --with-figleaf for Titus one; support for coverage3 should be coming), and you can write plugisn for your own build system, too.
FWIW, this is what we do. Since I don't know about your Unit-Test and Regression-Test setup, you have to decide yourself whether this is helpful.
Every Python package has
UnitTests.
We automatically detect unit tests using nose. Nose automagically detects standard Python unit tests (basically everything that looks like a test). Thereby we don't miss unit-tests. Nose also has a plug-in concept so that you can produce, e.g. nice output.
We strive for 100% coverage for
unit-testing. To this end, we use
Coverage
to check, because a nose-plugin provides integration.
We have set up Eclipse (our IDE) to automatically run nose whenever a file changes so that the unit-tests always get executed, which shows code-coverage as a by-product.
"every single part of our project has a meaningful test in place"
"Part" is undefined. "Meaningful" is undefined. That's okay, however, since it gets better further on.
"validates the correctness of every component in our system"
"Component" is undefined. But correctness is defined, and we can assign a number of alternatives to component. You only mention Python, so I'll assume the entire project is pure Python.
Validates the correctness of every module.
Validates the correctness of every class of every module.
Validates the correctness of every method of every class of every module.
You haven't asked about line of code coverage or logic path coverage, which is a good thing. That way lies madness.
"guarantees that when we change something we can spot unintentional changes to other sub-systems"
This is regression testing. That's a logical consequence of any unit testing discipline.
Here's what you can do.
Enumerate every module. Create a unittest for that module that is just a unittest.main(). This should be quick -- a few days at most.
Write a nice top-level unittest script that uses a testLoader to all unit tests in your tests directory and runs them through the text runner. At this point, you'll have a lot of files -- one per module -- but no actual test cases. Getting the testloader and the top-level script to work will take a few days. It's important to have this overall harness working.
Prioritize your modules. A good rule is "most heavily reused". Another rule is "highest risk from failure". Another rule is "most bugs reported". This takes a few hours.
Start at the top of the list. Write a TestCase per class with no real methods or anything. Just a framework. This takes a few days at most. Be sure the docstring for each TestCase positively identifies the Module and Class under test and the status of the test code. You can use these docstrings to determine test coverage.
At this point you'll have two parallel tracks. You have to actually design and implement the tests. Depending on the class under test, you may have to build test databases, mock objects, all kinds of supporting material.
Testing Rework. Starting with your highest priority untested module, start filling in the TestCases for each class in each module.
New Development. For every code change, a unittest.TestCase must be created for the class being changed.
The test code follows the same rules as any other code. Everything is checked in at the end of the day. It has to run -- even if the tests don't all pass.
Give the test script to the product manager (not the QA manager, the actual product manager who is responsible for shipping product to customers) and make sure they run the script every day and find out why it didn't run or why tests are failing.
The actual running of the master test script is not a QA job -- it's everyone's job. Every manager at every level of the organization has to be part of the daily build script output. All of their jobs have to depend on "all tests passed last night". Otherwise, the product manager will simply pull resources away from testing and you'll have nothing.
Assuming you already have a relatively comprehensive test suite, there are tools for the python part. The C part is much more problematic, depending on tools availability.
For python unit tests
For C code, it is difficult on many platforms because gprof, the Gnu code profiler cannot handle code built with -fPIC. So you have to build every extension statically in this case, which is not supported by many extensions (see my blog post for numpy, for example). On windows, there may be better code coverage tools for compiled code, but that may require you to recompile the extensions with MS compilers.
As for the "right" code coverage, I think a good balance it to avoid writing complicated unit tests as much as possible. If a unit test is more complicated than the thing it tests, then it is a probably not a good test, or a broken test.
So you've got some legacy code lying around in a fairly hefty project. How can you find and delete dead functions?
I've seen these two references: Find unused code and Tool to find unused functions in php project, but they seem specific to C# and PHP, respectively.
Is there a Python tool that'll help you find functions that aren't referenced anywhere else in the source code (notwithstanding reflection/etc.)?
In Python you can find unused code by using dynamic or static code analyzers. Two examples for dynamic analyzers are coverage and figleaf. They have the drawback that you have to run all possible branches of your code in order to find unused parts, but they also have the advantage that you get very reliable results.
Alternatively, you can use static code analyzers that just look at your code, but don't actually run it. They run much faster, but due to Python's dynamic nature the results may contain false positives.
Two tools in this category are pyflakes and vulture. Pyflakes finds unused imports and unused local variables. Vulture finds all kinds of unused and unreachable code. (Full disclosure: I'm the maintainer of Vulture.)
The tools are available in the Python Package Index https://pypi.org/.
I'm not sure if this is helpful, but you might try using the coverage, figleaf or other similar modules, which record which parts of your source code is used as you actually run your scripts/application.
Because of the fairly strict way python code is presented, would it be that hard to build a list of functions based on a regex looking for def function_name(..) ?
And then search for each name and tot up how many times it features in the code. It wouldn't naturally take comments into account but as long as you're having a look at functions with less than two or three instances...
It's a bit Spartan but it sounds like a nice sleepy-weekend task =)
unless you know that your code uses reflection, as you said, I would go for a trivial grep. Do not underestimate the power of the asterisk in vim as well (performs a search of the word you have under your cursor in the file), albeit this is limited only to the file you are currently editing.
Another solution you could implement is to have a very good testsuite (seldomly happens, unfortunately) and then wrap the routine with a deprecation routine. if you get the deprecation output, it means that the routine was called, so it's still used somewhere. This works even for reflection behavior, but of course you can never be sure if you don't trigger the situation when your routine call is performed.
its not only searching function names, but also all the imported packages not in use.
you need to search the code for all the imported packages (including aliases) and search used functions, then create a list of the specific imports from each package (example instead of import os, replace with from os import listdir, getcwd,......)