Determine usage/creation of object and data member into another module

Determine usage/creation of object and data member into another module - python

In legacy system, We have created init module which load information and used by various module(import statement). It's big module which consume more memory and process longer time and some of information is not needed or not used till now. There is two propose solution.
Can we determine in Python who is using this module.Fox Example
LoadData.py ( init Module)
contain 100 data member
A.py
import LoadData
b = LoadData.name
B.py
import LoadData
b = LoadData.width
In above example, A.py using name and B.py is using width and rest information is not required (98 data member is not required).
is there anyway which help us to determine usage of LoadData module along with usage of data member.
In simple, we need to traverse A.py and B.py and find manually to identify usage of object.
I am trying to implement first solution as I have more than 1000 module and it will be painful to determine by traversing each module. I am open to any tool which can integrate into python

Your question is quite broad, so I can't give you an exact answer. However, what I would generally do here is to run a linter like flake8 over the whole codebase to show you where you have unused imports and if you have references in your files to things that you haven't imported. It won't tell you if a whole file is never imported by anything, but if you remove all unused imports, you can then search your codebase for imports of a particular module and if none are found, you can (relatively) safely delete that module.
You can integrate tools like flake8 with most good text editors, so that they highlight mistakes in real time.
As you're trying to work with legacy code, you'll more than likely have many errors when you run the tool, as it looks out for style issues as well as the kinds of import/usage issues that youre mention. I would recommend fixing these as a matter of principle (as they they are non-functional in nature), and then making sure that you run flake8 as part of your continuous integration to avoid regressions. You can, however, disable particular warnings with command-line arguments, which might help you stage things.
Another thing you can start to do, though it will take a little longer to yield results, is write and run unit tests with code coverage switched on, so you can see areas of your codebase that are never executed. With a large and legacy project, however, this might be tough going! It will, however, help you gain better insight into the attribute usage you mention in point 1. Because Python is very dynamic, static analysis can only go so far in giving you information about atttribute usage.
Also, make sure you are using a version control tool (such as git) so that you can track any changes and revert them if you go wrong.

Related

How To Remove Unused Python Function Automatically [duplicate]

So you've got some legacy code lying around in a fairly hefty project. How can you find and delete dead functions?
I've seen these two references: Find unused code and Tool to find unused functions in php project, but they seem specific to C# and PHP, respectively.
Is there a Python tool that'll help you find functions that aren't referenced anywhere else in the source code (notwithstanding reflection/etc.)?

In Python you can find unused code by using dynamic or static code analyzers. Two examples for dynamic analyzers are coverage and figleaf. They have the drawback that you have to run all possible branches of your code in order to find unused parts, but they also have the advantage that you get very reliable results.
Alternatively, you can use static code analyzers that just look at your code, but don't actually run it. They run much faster, but due to Python's dynamic nature the results may contain false positives.
Two tools in this category are pyflakes and vulture. Pyflakes finds unused imports and unused local variables. Vulture finds all kinds of unused and unreachable code. (Full disclosure: I'm the maintainer of Vulture.)
The tools are available in the Python Package Index https://pypi.org/.

I'm not sure if this is helpful, but you might try using the coverage, figleaf or other similar modules, which record which parts of your source code is used as you actually run your scripts/application.

Because of the fairly strict way python code is presented, would it be that hard to build a list of functions based on a regex looking for def function_name(..) ?
And then search for each name and tot up how many times it features in the code. It wouldn't naturally take comments into account but as long as you're having a look at functions with less than two or three instances...
It's a bit Spartan but it sounds like a nice sleepy-weekend task =)

unless you know that your code uses reflection, as you said, I would go for a trivial grep. Do not underestimate the power of the asterisk in vim as well (performs a search of the word you have under your cursor in the file), albeit this is limited only to the file you are currently editing.
Another solution you could implement is to have a very good testsuite (seldomly happens, unfortunately) and then wrap the routine with a deprecation routine. if you get the deprecation output, it means that the routine was called, so it's still used somewhere. This works even for reflection behavior, but of course you can never be sure if you don't trigger the situation when your routine call is performed.

its not only searching function names, but also all the imported packages not in use.
you need to search the code for all the imported packages (including aliases) and search used functions, then create a list of the specific imports from each package (example instead of import os, replace with from os import listdir, getcwd,......)

Choose Python classes to instantiate at runtime based on either user input or on command line parameters

I am starting a new Python project that is supposed to run both sequentially and in parallel. However, because the behavior is entirely different, running in parallel would require a completely different set of classes than those used when running sequentially. But there is so much overlap between the two codes that it makes sense to have a unified code and defer the parallel/sequential behavior to a certain group of classes.
Coming from a C++ world, I would let the user set a Parallel or Serial class in the main file and use that as a template parameter to instantiate other classes at runtime. In Python there is no compilation time so I'm looking for the most Pythonic way to accomplish this. Ideally, it would be great that the code determines whether the user is running sequentially or in parallel to select the classes automatically. So if the user runs mpirun -np 4 python __main__.py the code should behave entirely different than when the user calls just python __main__.py. Somehow it makes no sense to me to have if statements to determine the type of an object at runtime, there has to be a much more elegant way to do this. In short, I would like to avoid:
if isintance(a, Parallel):
m = ParallelObject()
elif ifinstance(a, Serial):
m = SerialObject()
I've been reading about this, and it seems I can use factories (which somewhat have this conditional statement buried in the implementation). Yet, using factories for this problem is not an option because I would have to create too many factories.
In fact, it would be great if I can just "mimic" C++'s behavior here and somehow use Parallel/Serial classes to choose classes properly. Is this even possible in Python? If so, what's the most Pythonic way to do this?
Another idea would be to detect whether the user is running in parallel or sequentially and then load the appropriate module (either from a parallel or sequential folder) with the appropriate classes. For instance, I could have the user type in the main script:
from myPackage.parallel import *
or
from myPackage.serial import *
and then have the parallel or serial folders import all shared modules. This would allow me to keep all classes that differentiate parallel/serial behavior with the same names. This seems to be the best option so far, but I'm concerned about what would happen when I'm running py.test because some test files will load parallel modules and some other test files would load the serial modules. Would testing work with this setup?

You may want to check how a similar issue is solved in the stdlib: https://github.com/python/cpython/blob/master/Lib/os.py - it's not a 100% match to your own problem, nor the only possible solution FWIW, but you can safely assume this to be a rather "pythonic" solution.
wrt/ the "automagic" thing depending on execution context, if you decide to go for it, by all means make sure that 1/ both implementations can still be explicitely imported (like os.ntpath and os.posixpath) so they are truly unit-testable, and 2/ the user can still manually force the choice.
EDIT:
So if I understand it correctly, this file you points out imports modules depending on (...)
What it "depends on" is actually mostly irrelevant (in this case it's a builtin name because the target OS is known when the runtime is compiled, but this could be an environment variable, a command line argument, a value in a config file etc). The point was about both conditional import of modules with same API but different implementations while still providing direct explicit access to those modules.
So in a similar way, I could let the user type from myPackage.parallel import * and then in myPackage/init.py I could import all the required modules for the parallel calculation. Is this what you suggest?
Not exactly. I posted this as an example of conditional imports mostly, and eventually as a way to build a "bridge" module that can automagically select the appropriate implementation at runtime (on which basis it does so is up to you).
The point is that the end user should be able to either explicitely select an implementation (by explicitely importing the right submodule - serial or parallel and using it directly) OR - still explicitely - ask the system to select one or the other depending on the context.
So you'd have myPackage.serial and myPackage.parallel (just as they are now), and an additional myPackage.automagic that dynamically selects either serial or parallel. The "recommended" choice would then be to use the "automagic" module so the same code can be run either serial or parallel without the user having to care about it, but with still the ability to force using one or the other where it makes sense.
My fear is that py.test will have modules from parallel and serial while testing different files and create a mess
Why and how would this happen ? Remember that Python has no "process-global" namespace - "globals" are really "module-level" only - and that python's import is absolutely nothing like C/C++ includes.
import loads a module object (can be built directly from python source code, or from compiled C code, or even dynamically created - remember, at runtime a module is an object, instance of the module type) and binds this object (or attributes of this object) into the enclosing scope. Also, modules are garanteed (with a couple caveats, but those are to be considered as error cases) to be imported only once for a given process (and then cached) so importing the same module twice in a same process will yield the same object (IOW a module is a singleton).
All this means that given something like
# module A
def foo():
return bar(42)
def bar(x):
return x * 2
and
# module B
def foo():
return bar(33)
def bar(x):
return x / 2
It's garanteed that however you import from A and B, A.foo will ALWAYS call A.bar and NEVER call B.bar and B.foo will only ever call B.bar (unless you explicitely monkeyptach them of course but that's not the point).
Also, this means that within a module you cannot have access to the importing namespace (the module or function that's importing your module), so you cannot have a module depending on "global" names set by the importer.
To make a long story short, you really need to forget about C++ and learn how Python works, as those are wildly different languages with wildly different object models, execution models and idioms. A couple interesting reads are http://effbot.org/zone/import-confusion.htm and https://nedbatchelder.com/text/names.html
EDIT 2:
(about the 'automagic' module)
I would do that based on whether the user runs mpirun or just python. However, it seems it's not possible (see for instance this or this) in a portable way without a hack. Any ideas in that direction?
I've never ever had anything to do with mpi so I can't help with this - but if the general consensus is that there's no reliable portable way to detect this then obviously there's your answer.
This being said, simple stupid solutions are sometimes overlooked. In your case, explicitly setting an environment variable or passing a command-line switch to your main script would JustWork(tm), ie the user should for example use
SOMEFLAG=serial python main.py
vs
SOMEFLAG=parallel mpirun -np4 python main.py
or
python main.py serial
vs
mpirun -np4 python main.py parallel
(whichever works best for you needs - is the most easily portable).
This of course requires a bit more documentation and some more effort from the end-user but well...

I'm not really what you're asking here. Python classes are just (callable/instantiable) objects themselves, so you can of course select and use them conditionally. If multiple classes within multiple modules are involved, you can also make the imports conditional.
if user_says_parallel:
from myPackage.parallel import ParallelObject
ObjectClass = ParallelObject
else:
from myPackage.serial import SerialObject
ObjectClass = SerialObject
my_abstract_object = ObjectClass()
If that's very useful depends on your classes and the effort it takes to make sure they have the same API so they're compatible when replacing each other. Maybe even inheritance à la ParallelObject => SerialObject is possible, or at least a common (virtual) base class to put all the shared code. But that's just the same as in C++.

Identifying "sensitive" code in your application

Looking to improve quality of a fairly large Python project. I am happy with the types of warnings PyLint gives me. However, they are just too numerous and hard to enforce across a large organization. Also I believe that some code is more critical/sensitive than others with respect to where the next bug may come. For example I would like to spend more time validating a library method that is used by 100 modules rather than a script that was last touched 2 years ago and may not be used in production. Also it would be interesting to know modules that are frequently updated.
Is anyone familiar with tools for Python or otherwise that help with this type of analysis?

You problem is similar to the one I answered over at SQA https://sqa.stackexchange.com/a/3082. This problem was associated with Java which made the tooling a bit easier, but I have a number of suggestions below.
A number of other answers suggest that there is no good runtime tools for Python. I disagree on this in several ways:
Coverage tools work very well
Based on my experience in tooling in Java, static and dynamic analysis tools in Python are weaker than in a strongly typed less dynamic language but will work more than well enough to give good heuristics for you here. Unless you use an unusually large pathological number of dynamic features (including adding and removing methods, intercepting method and property invocations, playing with import, manually modifying the namespace) - in which case any problems you have may well be associated with this dynamism...
Pylint picks up simpler problems, and will not detect problems with dynamic class/instance modifications and decorators - so it doesn't matter that the metric tools don't measure these
In any case, where you can usefully focus can be determined by much more than a dependency graph.
Heuristics for selecting code
I find that there are a number of different considerations for selecting code for improvement which work both individually and together. Remember that, at first, all you need to do is find a productive seam of work - you don't need to find the absolutely worst code before you start.
Use your judgement.
After a few cycles through the codebase, you will have a huge amount of information and be much better positioned to continue your work - if indeed more needs to be done.
That said, here are my suggestions:
High value to the business: For example any code that could cost your company a lot of money. Many of these may be obvious or widely known (because they are important), or they may be detected by running the important use cases on a system with the run-time profiler enabled. I use Coverage.
Static code metrics: There are a lot of metrics, but the ones that concern us are:
High afferent couplings. This is code that a lot of other files depends on. While I don't have a tool that directly outputs this, snakefood is a good way to dump the dependencies directly to file, one line per dependency, each being a tuple of afferent and efferent file. I hate to say it, but computing the afferent coupling value from this file is a simple exercise left to the reader.
High McCabe (cyclomatic) complexity: This is more complex code. PyMetrics seems to produce this measure although I have not used the tool.
Size: You can get a surprising amount of information by viewing the size of your project using a visualiser (eg https://superuser.com/questions/8248/how-can-i-visualize-the-file-system-usage-on-windows or https://superuser.com/questions/86194/good-program-to-visualize-file-system-usage-on-mac?lq=1. Linux has KDirStat at Filelight). Large files are a good place to start as fixing one file fixes many warnings.
Note that these tools are file-based. This is probably fine-enough resolution since you mention the project is itself has hundreds of modules (files).
Changes frequently: Code that changes frequently is highly suspect. The code may:
Historically have had many defects, and empirically may continue to do so
Be undergoing changes from feature development (high number of revisions in your VCS)
Find areas of change using a VCS visualisation tool such as those discussed later in this answer.
Uncovered code: Code not covered by tests.
If you run (or can run) your unit tests, your other automated tests and typical user tests with coverage, take a look at the packages and files with next to no coverage. There are two logical reasons why there is no coverage:
The code is needed (and important) but not tested at all (at least automatically). These areas are extremely high risk
The code may be unused and is a candidate for removal.
Ask other developers
You may be surprised at the 'smell' metrics you can gather by having a coffee with the longer-serving developers. I bet they will be very happy if someone cleans up a dirty area of the codebase where only the bravest souls will venture.
Visibility - detecting changes over time
I am assuming that your environment has a DVCS (such as Git or Mercurial) or at least a VCS (eg SVN). I hope that you are also using an issue or bug tracker of some kind. If so, there is a huge amount of information available. It's even better if developers have reliably checked in with comments and issue numbers. But how do you visualise it and use it?
While you can tackle the problem on a single desktop, it is probably a good idea to set up a Continuous Integration (CI) environment, perhaps using a tool like Jenkins. To keep the answer short, I will assume Jenkins from now on. Jenkins comes with a large number of plugins that really help with code analysis. I use:
py.test with JUnit test output picked up by the JUnit test report Jenkins plugin
Coverage with the Cobertura plugin
SLOCCount and SLOCCount plugin
Pylint and Violations plugin
Apparently there is a plugin for McCabe (cyclometric) complexity for Python, although I have not used it. It certainly looks interesting.
This gives me visibility of changes over time, and I can drill in from there. For example, suppose PyLint violations start increasing in a module - I have evidence of the increase, and I know the package or file in which this is occurring, so I can find out who's involved and go speak with them.
If you need historic data and you have just installed Jenkins, see if you can run a few manual builds that start at the beginning of the project and take a series of jumps forward in time until the present. You can choose milestone release tags (or dates) from the VCS.
Another important area, as mentioned above, is detecting the loci of changes in the code base. I have really liked Atlassian Fisheye for this. Besides being really good at searching for commit messages (eg bug id) or file contents at any point in time, it allows me to easily see metrics:
Linecount by directory and subdirectory
Committers at any point in time or in specific directories and/or files
Patterns of committal, both by time and also location in the source code

I'm afraid you are mostly on your own.
If you have decent set of tests, look at code coverage and dead code.
If you have a decent profiling setup, use that to get a glimpse of what's used more.
In the end, it seems you are more interested in fan-in/fan-out analysis, I'm not aware of any good tools for Python, primarily because static analysis is horribly unreliable against a dynamic language, and so far I didn't see any statistical analysis tools.
I reckon that this information is sort of available in JIT compilers -- whatever (function, argument types) is in cache (compiled) those are used the most. Whether or not you can get this data out of e.g. PyPy I really don't have a clue.

Source control tools can give a good indication of frequently updated modules - often indicating trouble spots.
If you don't have source control but the project is run from a shared location delete all the pycache folders or .pyc files. Over time/under use watch which files get recreated to indicate their use.
Analysing the Python imports printed when running from particular entry points with
python -v entry_point
may give some insight into which modules are being used. Although if you have known access points you should try the coverage module.
For a more intrusive solution, consider setting up project wide logging. You can log metrics easy enough, even over distributed programs.

I agree with the others, in that I have yet to come across a good runtime analysis tool for Python that will do this.
There are some ways to deal with it, but none are trivial.
The most robust, I think, would be to get the Python source and recompile the binaries with some sort of built-in runtime logging. That way you could just slide it into the existing environment without any code changes to your project.
Of course, that isn't exactly trivial to do, but it has the bonus that you might some day be able to get that merged back into the trunk for future generations and what not.
For non-recompile approaches, the first place I would look is the profile library's deterministic profiling section.
How you implement it will be heavily dependent on how your environment is set up. Do you have many individual scripts and projects run independently of one another, or just the one main script or module or package that gets used by everybody else, and you just want to know what parts of it can be trimmed out to make maintenance easier?
Is it a load once, run forever kind of set up, or a situation where you just run scripts atomically on some sort of schedule?
You could implement project-wide logging (as mentioned in #Hardbyte's answer), but that would require going through the project and adding the logging lines to all of your code. If you do that, you may as well just do it using the built-in profiler, I think.

Have a look at sys.setprofile: it allows you to install a profiler function.
Its usage is detailed in http://docs.python.org/library/profile.html#profile, for a jumpstart go here.
If you can not profile your application you will be bound to the cooverage approach.
Another thing you might have a look at is decorators, you can write a debugging decorator, and apply it to set of functions you suspect. Take alook here to see how to apply the decorator to an entire module.
You might also take a look at python call graph, while it will not generate quite what you want it shows you how often one function calls another:
If your code runs on user input this will be hard, since you would have to simulate 'typical' usage.
There is not more to tell you, just remember profiling as keyword.

Pylint sometimes gives warnings that (after careful consideration) are not justified. In which case it is useful to make use of the special #pylint: disable=X0123 comments (where X0123 is the actual error/warning message number) if the code cannot be refactored to not trigger the warning.
I'd like to second Hardbyte's mention of using your source control logs to see which files are most often changed.
If you are working on a system that has find, grep and sort installed, the following is a way to check which file imports what;
find . -name '*.py' -exec grep -EH '^import|^from .* import' {} \+| sort |less
To find the most popular imports across all files;
find . -name '*.py' -exec grep -Eh '^import|^from .* import' {} \+ | sort | less
These two commands should help you find the most-used modules from your project.

Mapping module imports in Python for easy refactoring

I have a bunch of Python modules I want to clean up, reorganize and refactor (there's some duplicate code, some unused code ...), and I'm wondering if there's a tool to make a map of which module uses which other module.
Ideally, I'd like a map like this:
main.py
-> task_runner.py
-> task_utils.py
-> deserialization.py
-> file_utils.py
-> server.py
-> (deserialization.py)
-> db_access.py
checkup_script.py
re_test.py
main_bkp0.py
unit_tests.py
... so that I could tell which files I can start moving around first (file_utils.py, db_access.py), which files are not used by my main.py and so could be deleted, etc. (I'm actually working with around 60 modules)
Writing a script that does this probably wouldn't be very complicated (though there are different syntaxes for import to handle), but I'd also expect that I'm not the first one to want to do this (and if someone made a tool for this, it might include other neat features such as telling me which classes and functions are probably not used).
Do you know of any tools (even simple scripts) that assist code reorganization?
Do you know of a more exact term for what I'm trying to do? Code reorganization?

Python's modulefinder does this. It is quite easy to write a script that will turn this information into an import graph (which you can render with e.g. graphviz): here's a clear explanation. There's also snakefood which does all the work for you (and using ASTs, too!)
You might want to look into pylint or pychecker for more general maintenance tasks.

Writing a script that does this probably wouldn't be very complicated (though there are different syntaxes for import to handle),
It's trivial. There's import and from module import. Two syntax to handle.
Do you know of a more exact term for what I'm trying to do? Code reorganization?
Design. It's called design. Yes, you're refactoring an existing design, but...
Rule One
Don't start a design effort with what you have. If you do, you'll only "nibble around the edges" making small and sometimes inconsequential changes.
Rule Two
Start a design effort with what you should have had if you'd only been smarter. Think broadly and clearly about what you're really supposed to be doing. Ignore what you did.
Rule Three
Design from the ground up (or de novo as some folks say) with the correct package and module architecture.
Create a separate project for this.
Rule Four
Test First. Write unit tests for your new architecture. If you have existing unit tests, copy them into the new project. Modify the imports to reflect the new architecture and rewrite the tests to express your glorious new simplification.
All the tests fail, because you haven't moved any code. That's a good thing.
Rule Five
Move code into the new structure last. Stop moving code when the tests pass.
You don't need to analyze imports to do this, BTW. You're just using grep to find modules and classes. The old imports and the tangled relationships among the old imports doesn't matter, and doesn't need to be analyzed. You're throwing it away. You don't need tools smarter than grep.
If feel an urge to move code, you must be very disciplined. (1) you must have test(s) which fail and then (2) you can move some code to pass the failing test(s).

chuckmove is a tool that lets you recursively rewrite imports in your entire source tree to refer to a new location of a module.
chuckmove --old sound.utils --new media.sound.utils src
...this descends into src, and rewrites statements that import sound.utils to import media.sound.utils instead. It supports the whole range of Python import formats. I.e. from x import y, import x.y.z as w etc.

Modulefinder may not work with Python 3.5*, but pydeps worked very well:
Installation:
sudo apt install python-pygraphviz
pip install pydeps
Then, in the directory where you want to map from,
pydeps --max-bacon=0 .
..to create a map of maximum depth.
*An issue in Python 3.5 but not 3.6 caused the problems with modulefinder, similar to this

How can you find unused functions in Python code?

So you've got some legacy code lying around in a fairly hefty project. How can you find and delete dead functions?
I've seen these two references: Find unused code and Tool to find unused functions in php project, but they seem specific to C# and PHP, respectively.
Is there a Python tool that'll help you find functions that aren't referenced anywhere else in the source code (notwithstanding reflection/etc.)?

In Python you can find unused code by using dynamic or static code analyzers. Two examples for dynamic analyzers are coverage and figleaf. They have the drawback that you have to run all possible branches of your code in order to find unused parts, but they also have the advantage that you get very reliable results.
Alternatively, you can use static code analyzers that just look at your code, but don't actually run it. They run much faster, but due to Python's dynamic nature the results may contain false positives.
Two tools in this category are pyflakes and vulture. Pyflakes finds unused imports and unused local variables. Vulture finds all kinds of unused and unreachable code. (Full disclosure: I'm the maintainer of Vulture.)
The tools are available in the Python Package Index https://pypi.org/.

I'm not sure if this is helpful, but you might try using the coverage, figleaf or other similar modules, which record which parts of your source code is used as you actually run your scripts/application.

Because of the fairly strict way python code is presented, would it be that hard to build a list of functions based on a regex looking for def function_name(..) ?
And then search for each name and tot up how many times it features in the code. It wouldn't naturally take comments into account but as long as you're having a look at functions with less than two or three instances...
It's a bit Spartan but it sounds like a nice sleepy-weekend task =)

unless you know that your code uses reflection, as you said, I would go for a trivial grep. Do not underestimate the power of the asterisk in vim as well (performs a search of the word you have under your cursor in the file), albeit this is limited only to the file you are currently editing.
Another solution you could implement is to have a very good testsuite (seldomly happens, unfortunately) and then wrap the routine with a deprecation routine. if you get the deprecation output, it means that the routine was called, so it's still used somewhere. This works even for reflection behavior, but of course you can never be sure if you don't trigger the situation when your routine call is performed.

its not only searching function names, but also all the imported packages not in use.
you need to search the code for all the imported packages (including aliases) and search used functions, then create a list of the specific imports from each package (example instead of import os, replace with from os import listdir, getcwd,......)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.