Strategies for avoiding python scope errors

Strategies for avoiding python scope errors - python

I use IPython Notebooks extensively in my research. I find them to be a wonderful tool.
However, on more than one occasion, I have been bitten by subtle bugs stemming from variable scope. For example, I will be doing some exploratory analysis:
foo = 1
bar = 2
foo + bar
And I decide that foo + bar is a useful algorithm for my purposes, so I encapsulate it in a function to make it easier to apply to a wider range of inputs:
def the_function(foo, bar):
return foo + bar
Inevitably, somewhere down the line, after building a workflow from the ground up, I will have a typo somewhere (e.g. def the_function(fooo, bar):) that causes a global variable to be used (and/or modified) in a function call. This causes unseen side effects and leads to spurious results. But because it typically returns a result, it can be difficult to find where the problem actually occurs.
Now, I recognize that this behavior is a feature, which I deliberately use often (for convenience, or for necessity i.e. function closures or decorators). But as I keep running into bugs, I'm thinking I need a better strategy for avoiding such problems (current strategy = "be careful").
For example, one strategy might be to always prepend '_' to local variable names. But I'm curious if there are not other strategies - even "pythonic" strategies, or community encouraged strategies.
I know that python 2.x differs in some regards to python 3.x in scoping - I use python 3.x.
Also, strategies should consider the interactive nature of scientific computing, as would be used in an IPython Notebook venue.
Thoughts?
EDIT: To be more specific, I am looking for IPython Notebook strategies.

I was tempted to flag this question as too broad, but perhaps the following will help you.
When you decide to wrap some useful code in a function, write some tests. If you think the code is useful, you must have used it with some examples. Write the test first lest you 'forget'.
My personal policy for a library module is to run the test in an if __name__
== '__main__': statement, whether the test code is in the same file or a different file. I also execute the file to run the tests multiple times during a programming session, after every small unit of change (trivial in Idle or similar IDE).
Use a code checker program, which will catch some typo-based errors. "'fooo' set but never used".
Keep track of the particular kinds of errors you make, analyze them and think about personal countermeasures, or at least learn to recognize the symptoms.
Looking at your example, when you do write a function, don't use the same names for both global objects and parameters. In your example, delete or change the global 'foo' and 'bar' or use something else for parameter names.

I would suggest that you separate your concerns. For your exploratory analysis, write your code in the iPython notebook, but when you've decided that there are some functions that are useful, instead, open up an editor and put your functions into a python file which you can then import.
You can use iPython magics to auto reload things you've imported. So once you've tested them in iPython, you can simply copy them to your module. This way, the scope of your functions is isolated from your notebook. An additional advantage is that when you're ready to run things in a headless environment, you already have your entire codebase in one place.

In the end, I made my own solution to the problem. It builds on both answers given so far.
You can find my solution, which is a cell magic extension, on github: https://github.com/brazilbean/modulemagic
In brief, this extension gives you the ability to create %%module cells in the notebook. These cells are saved as a file and imported back into your session. It effectively accomplishes what #shadanan had suggested, but allows you to keep all your work in the same place (convenient, and in line with the Notebook philosophy of providing code and results in the same place).
Because the import process sandboxes the code, it solves all of the scope shadowing errors that motivated my original question. It also involves little to no overhead to use - no renaming of variables, having other editors open, etc.

Related

How To Remove Unused Python Function Automatically [duplicate]

So you've got some legacy code lying around in a fairly hefty project. How can you find and delete dead functions?
I've seen these two references: Find unused code and Tool to find unused functions in php project, but they seem specific to C# and PHP, respectively.
Is there a Python tool that'll help you find functions that aren't referenced anywhere else in the source code (notwithstanding reflection/etc.)?

In Python you can find unused code by using dynamic or static code analyzers. Two examples for dynamic analyzers are coverage and figleaf. They have the drawback that you have to run all possible branches of your code in order to find unused parts, but they also have the advantage that you get very reliable results.
Alternatively, you can use static code analyzers that just look at your code, but don't actually run it. They run much faster, but due to Python's dynamic nature the results may contain false positives.
Two tools in this category are pyflakes and vulture. Pyflakes finds unused imports and unused local variables. Vulture finds all kinds of unused and unreachable code. (Full disclosure: I'm the maintainer of Vulture.)
The tools are available in the Python Package Index https://pypi.org/.

I'm not sure if this is helpful, but you might try using the coverage, figleaf or other similar modules, which record which parts of your source code is used as you actually run your scripts/application.

Because of the fairly strict way python code is presented, would it be that hard to build a list of functions based on a regex looking for def function_name(..) ?
And then search for each name and tot up how many times it features in the code. It wouldn't naturally take comments into account but as long as you're having a look at functions with less than two or three instances...
It's a bit Spartan but it sounds like a nice sleepy-weekend task =)

unless you know that your code uses reflection, as you said, I would go for a trivial grep. Do not underestimate the power of the asterisk in vim as well (performs a search of the word you have under your cursor in the file), albeit this is limited only to the file you are currently editing.
Another solution you could implement is to have a very good testsuite (seldomly happens, unfortunately) and then wrap the routine with a deprecation routine. if you get the deprecation output, it means that the routine was called, so it's still used somewhere. This works even for reflection behavior, but of course you can never be sure if you don't trigger the situation when your routine call is performed.

its not only searching function names, but also all the imported packages not in use.
you need to search the code for all the imported packages (including aliases) and search used functions, then create a list of the specific imports from each package (example instead of import os, replace with from os import listdir, getcwd,......)

Determine usage/creation of object and data member into another module

In legacy system, We have created init module which load information and used by various module(import statement). It's big module which consume more memory and process longer time and some of information is not needed or not used till now. There is two propose solution.
Can we determine in Python who is using this module.Fox Example
LoadData.py ( init Module)
contain 100 data member
A.py
import LoadData
b = LoadData.name
B.py
import LoadData
b = LoadData.width
In above example, A.py using name and B.py is using width and rest information is not required (98 data member is not required).
is there anyway which help us to determine usage of LoadData module along with usage of data member.
In simple, we need to traverse A.py and B.py and find manually to identify usage of object.
I am trying to implement first solution as I have more than 1000 module and it will be painful to determine by traversing each module. I am open to any tool which can integrate into python

Your question is quite broad, so I can't give you an exact answer. However, what I would generally do here is to run a linter like flake8 over the whole codebase to show you where you have unused imports and if you have references in your files to things that you haven't imported. It won't tell you if a whole file is never imported by anything, but if you remove all unused imports, you can then search your codebase for imports of a particular module and if none are found, you can (relatively) safely delete that module.
You can integrate tools like flake8 with most good text editors, so that they highlight mistakes in real time.
As you're trying to work with legacy code, you'll more than likely have many errors when you run the tool, as it looks out for style issues as well as the kinds of import/usage issues that youre mention. I would recommend fixing these as a matter of principle (as they they are non-functional in nature), and then making sure that you run flake8 as part of your continuous integration to avoid regressions. You can, however, disable particular warnings with command-line arguments, which might help you stage things.
Another thing you can start to do, though it will take a little longer to yield results, is write and run unit tests with code coverage switched on, so you can see areas of your codebase that are never executed. With a large and legacy project, however, this might be tough going! It will, however, help you gain better insight into the attribute usage you mention in point 1. Because Python is very dynamic, static analysis can only go so far in giving you information about atttribute usage.
Also, make sure you are using a version control tool (such as git) so that you can track any changes and revert them if you go wrong.

Best practices for turning jupyter notebooks into python scripts

Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:
the notebook quickly becomes too complex and messy to be maintained and improved further as notebook, and I have to make python scripts out of it;
when it comes to production code (e.g. one that needs to be re-run every day), the notebook again is not the best format.
Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:
Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.
'+': quick
'-': dirty, non-flexible, not convenient to maintain
Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via argparse.
'+': more flexible usage; more readable code (if you properly transformed the pipeline logic to functions)
'-': oftentimes, the pipeline is NOT splittable into logically completed pieces that could become functions without any quirks in the code. All these functions are typically needed to be only called once in the script rather than to be called many times inside loops, maps etc. Furthermore, each function typically takes the output of all functions called before, so one has to pass many arguments to each function.
The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.
'+': you needn't to pass many arguments to each method -- all the previous outputs already stored as attributes
'-': the overall logic of a task is still not captured -- it is data and machine learning pipeline, not just class. The only goal for the class is to be created, call all the methods sequentially one-by-one and then be removed. On top of this, classes are quite long to implement.
Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.
I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.
Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?

Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal assert tests and docstrings.
After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.
Basic example of a cell's content with "minimal" tests and docstrings:
def zip_count(f):
"""Given zip filename, returns number of files inside.
str -> int"""
from contextlib import closing
with closing(zipfile.ZipFile(f)) as archive:
num_files = len(archive.infolist())
return num_files
zip_filename = 'data/myfile.zip'
# Make sure `myfile` always has three files
assert zip_count(zip_filename) == 3
# And total zip size is under 2 MB
assert os.path.getsize(zip_filename) / 1024**2 < 2
print(zip_count(zip_filename))
Once you've exported it to bare .py files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert statements that can easily be moved into tests.py for testing with pytest, unittest, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.
If all goes well, all you need to do after that is to write your if __name__ == '__main__': and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py file, etc.
It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.
Here's a few ideas for a notebook-to-script workflow:
Export the Jupyter Notebook to Python file (.py) through the GUI.
Remove the "helper" lines that don't do the actual work: print statements, plots, etc.
If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.
Write your script's entryways with if __name__ == '__main__'.
Separate your assert statements for each of your functions/methods, and flesh out a minimal test suite in tests.py.

We are having the similar issue. However we are using several notebooks for prototyping the outcomes which should become also several python scripts after all.
Our approach is that we put aside the code, which seams to repeat across those notebooks. We put it into the python module, which is imported by each notebook and also used in the production. We iteratively improve this module continuously and add tests of what we find during prototyping.
Notebooks then become rather like the configuration scripts (which we just plainly copy into the end resulting python files) and several prototyping checks and validations, which we do not need in the production.
Most of all we are not afraid of the refactoring :)

I made a module recently (NotebookScripter) to help address this issue. It allows you to invoke a jupyter notebook via a function call. Its as simple to use as
from NotebookScripter import run_notebook
run_notebook("./path/to/Notebook.ipynb", some_param="Provided Exteranlly")
Keyword parameters can be passed to the function call. Its easy to adapt a notebook to be parameterizable externally.
Within a .ipynb cell
from NotebookScripter import receive_parameter
some_param = receive_parameter(some_param="Return's this value by default when matching keyword not provided by external caller")
print("some_param={0} within the invocation".format(some_param))
run_notebook() supports .ipynb files or .py files -- allowing one to easily use .py files as might be generated by nbconvert of vscode's ipython. You can keep your code organized in a way that makes sense for interactive use, and also reuse/customize it externally when needed.

You should breakdown the logic in small steps, that way your pipeline will be easier to maintain. Since you already have a working codebase, you want to keep your code running, so make small changes, test and repeat.
I'd go this way:
Add some tests to your pipeline, for ML pipelines this is a bit hard, but if your notebook trains a model, you can use performance metrics to test if your pipeline still works (your test can be accuracy = 0.8, but make sure you define a tolerable range since the number hardly be the exact same for each run)
Break apart your single notebook into smaller ones, the output from one should the input for the other. As soon as you create a split, make sure you add a few tests for each notebook individually. To manage this sequential execution, you can use papermill to execute your notebooks or a workflow management tool such as ploomber which integrates with papermill, is able to resolve complex dependencies and has a hook to run tests upon notebook execution (Disclaimer: I'm ploomber's author)
Once you have a pipeline composed of several notebooks that passes all your tests you can decide whether you want to keep using the ipynb format or not. My recommendation would be to only keep as notebooks the tasks that have rich output (such as tables or plots), the rest can be refactored into Python functions, which are more maintainable

How can you find unused functions in Python code?

So you've got some legacy code lying around in a fairly hefty project. How can you find and delete dead functions?
I've seen these two references: Find unused code and Tool to find unused functions in php project, but they seem specific to C# and PHP, respectively.
Is there a Python tool that'll help you find functions that aren't referenced anywhere else in the source code (notwithstanding reflection/etc.)?

In Python you can find unused code by using dynamic or static code analyzers. Two examples for dynamic analyzers are coverage and figleaf. They have the drawback that you have to run all possible branches of your code in order to find unused parts, but they also have the advantage that you get very reliable results.
Alternatively, you can use static code analyzers that just look at your code, but don't actually run it. They run much faster, but due to Python's dynamic nature the results may contain false positives.
Two tools in this category are pyflakes and vulture. Pyflakes finds unused imports and unused local variables. Vulture finds all kinds of unused and unreachable code. (Full disclosure: I'm the maintainer of Vulture.)
The tools are available in the Python Package Index https://pypi.org/.

I'm not sure if this is helpful, but you might try using the coverage, figleaf or other similar modules, which record which parts of your source code is used as you actually run your scripts/application.

Because of the fairly strict way python code is presented, would it be that hard to build a list of functions based on a regex looking for def function_name(..) ?
And then search for each name and tot up how many times it features in the code. It wouldn't naturally take comments into account but as long as you're having a look at functions with less than two or three instances...
It's a bit Spartan but it sounds like a nice sleepy-weekend task =)

unless you know that your code uses reflection, as you said, I would go for a trivial grep. Do not underestimate the power of the asterisk in vim as well (performs a search of the word you have under your cursor in the file), albeit this is limited only to the file you are currently editing.
Another solution you could implement is to have a very good testsuite (seldomly happens, unfortunately) and then wrap the routine with a deprecation routine. if you get the deprecation output, it means that the routine was called, so it's still used somewhere. This works even for reflection behavior, but of course you can never be sure if you don't trigger the situation when your routine call is performed.

its not only searching function names, but also all the imported packages not in use.
you need to search the code for all the imported packages (including aliases) and search used functions, then create a list of the specific imports from each package (example instead of import os, replace with from os import listdir, getcwd,......)

How can I make sure all my Python code "compiles"?

My background is C and C++. I like Python a lot, but there's one aspect of it (and other interpreted languages I guess) that is really hard to work with when you're used to compiled languages.
When I've written something in Python and come to the point where I can run it, there's still no guarantee that no language-specific errors remain. For me that means that I can't rely solely on my runtime defense (rigorous testing of input, asserts etc.) to avoid crashes, because in 6 months when some otherwise nice code finally gets run, it might crack due to some stupid typo.
Clearly a system should be tested enough to make sure all code has been run, but most of the time I use Python for in-house scripts and small tools, which ofcourse never gets the QA attention they need. Also, some code is so simple that (if your background is C/C++) you know it will work fine as long as it compiles (e.g. getter-methods inside classes, usually a simple return of a member variable).
So, my question is the obvious - is there any way (with a special tool or something) I can make sure all the code in my Python script will "compile" and run?

Look at PyChecker and PyLint.
Here's example output from pylint, resulting from the trivial program:
print a
As you can see, it detects the undefined variable, which py_compile won't (deliberately).
in foo.py:
************* Module foo
C: 1: Black listed name "foo"
C: 1: Missing docstring
E: 1: Undefined variable 'a'
...
|error |1 |1 |= |
Trivial example of why tests aren't good enough, even if they cover "every line":
bar = "Foo"
foo = "Bar"
def baz(X):
return bar if X else fo0
print baz(input("True or False: "))
EDIT: PyChecker handles the ternary for me:
Processing ternary...
True or False: True
Foo
Warnings...
ternary.py:6: No global (fo0) found
ternary.py:8: Using input() is a security problem, consider using raw_input()

Others have mentioned tools like PyLint which are pretty good, but the long and the short of it is that it's simply not possible to do 100%. In fact, you might not even want to do it. Part of the benefit to Python's dynamicity is that you can do crazy things like insert names into the local scope through a dictionary access.
What it comes down to is that if you want a way to catch type errors at compile time, you shouldn't use Python. A language choice always involves a set of trade-offs. If you choose Python over C, just be aware that you're trading a strong type system for faster development, better string manipulation, etc.

I think what you are looking for is code test line coverage. You want to add tests to your script that will make sure all of your lines of code, or as many as you have time to, get tested. Testing is a great deal of work, but if you want the kind of assurance you are asking for, there is no free lunch, sorry :( .

If you are using Eclipse with Pydev as an IDE, it can flag many typos for you with red squigglies immediately, and has Pylint integration too. For example:
foo = 5
print food
will be flagged as "Undefined variable: food". Of course this is not always accurate (perhaps food was defined earlier using setattr or other exotic techniques), but it works well most of the time.
In general, you can only statically analyze your code to the extent that your code is actually static; the more dynamic your code is, the more you really do need automated testing.

Your code actually gets compiled when you run it, the Python runtime will complain if there is a syntax error in the code. Compared to statically compiled languages like C/C++ or Java, it does not check whether variable names and types are correct – for that you need to actually run the code (e.g. with automated tests).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.