Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:
the notebook quickly becomes too complex and messy to be maintained and improved further as notebook, and I have to make python scripts out of it;
when it comes to production code (e.g. one that needs to be re-run every day), the notebook again is not the best format.
Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:
Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.
'+': quick
'-': dirty, non-flexible, not convenient to maintain
Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via argparse.
'+': more flexible usage; more readable code (if you properly transformed the pipeline logic to functions)
'-': oftentimes, the pipeline is NOT splittable into logically completed pieces that could become functions without any quirks in the code. All these functions are typically needed to be only called once in the script rather than to be called many times inside loops, maps etc. Furthermore, each function typically takes the output of all functions called before, so one has to pass many arguments to each function.
The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.
'+': you needn't to pass many arguments to each method -- all the previous outputs already stored as attributes
'-': the overall logic of a task is still not captured -- it is data and machine learning pipeline, not just class. The only goal for the class is to be created, call all the methods sequentially one-by-one and then be removed. On top of this, classes are quite long to implement.
Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.
I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.
Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?
Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal assert tests and docstrings.
After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.
Basic example of a cell's content with "minimal" tests and docstrings:
def zip_count(f):
"""Given zip filename, returns number of files inside.
str -> int"""
from contextlib import closing
with closing(zipfile.ZipFile(f)) as archive:
num_files = len(archive.infolist())
return num_files
zip_filename = 'data/myfile.zip'
# Make sure `myfile` always has three files
assert zip_count(zip_filename) == 3
# And total zip size is under 2 MB
assert os.path.getsize(zip_filename) / 1024**2 < 2
print(zip_count(zip_filename))
Once you've exported it to bare .py files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert statements that can easily be moved into tests.py for testing with pytest, unittest, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.
If all goes well, all you need to do after that is to write your if __name__ == '__main__': and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py file, etc.
It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.
Here's a few ideas for a notebook-to-script workflow:
Export the Jupyter Notebook to Python file (.py) through the GUI.
Remove the "helper" lines that don't do the actual work: print statements, plots, etc.
If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.
Write your script's entryways with if __name__ == '__main__'.
Separate your assert statements for each of your functions/methods, and flesh out a minimal test suite in tests.py.
We are having the similar issue. However we are using several notebooks for prototyping the outcomes which should become also several python scripts after all.
Our approach is that we put aside the code, which seams to repeat across those notebooks. We put it into the python module, which is imported by each notebook and also used in the production. We iteratively improve this module continuously and add tests of what we find during prototyping.
Notebooks then become rather like the configuration scripts (which we just plainly copy into the end resulting python files) and several prototyping checks and validations, which we do not need in the production.
Most of all we are not afraid of the refactoring :)
I made a module recently (NotebookScripter) to help address this issue. It allows you to invoke a jupyter notebook via a function call. Its as simple to use as
from NotebookScripter import run_notebook
run_notebook("./path/to/Notebook.ipynb", some_param="Provided Exteranlly")
Keyword parameters can be passed to the function call. Its easy to adapt a notebook to be parameterizable externally.
Within a .ipynb cell
from NotebookScripter import receive_parameter
some_param = receive_parameter(some_param="Return's this value by default when matching keyword not provided by external caller")
print("some_param={0} within the invocation".format(some_param))
run_notebook() supports .ipynb files or .py files -- allowing one to easily use .py files as might be generated by nbconvert of vscode's ipython. You can keep your code organized in a way that makes sense for interactive use, and also reuse/customize it externally when needed.
You should breakdown the logic in small steps, that way your pipeline will be easier to maintain. Since you already have a working codebase, you want to keep your code running, so make small changes, test and repeat.
I'd go this way:
Add some tests to your pipeline, for ML pipelines this is a bit hard, but if your notebook trains a model, you can use performance metrics to test if your pipeline still works (your test can be accuracy = 0.8, but make sure you define a tolerable range since the number hardly be the exact same for each run)
Break apart your single notebook into smaller ones, the output from one should the input for the other. As soon as you create a split, make sure you add a few tests for each notebook individually. To manage this sequential execution, you can use papermill to execute your notebooks or a workflow management tool such as ploomber which integrates with papermill, is able to resolve complex dependencies and has a hook to run tests upon notebook execution (Disclaimer: I'm ploomber's author)
Once you have a pipeline composed of several notebooks that passes all your tests you can decide whether you want to keep using the ipynb format or not. My recommendation would be to only keep as notebooks the tasks that have rich output (such as tables or plots), the rest can be refactored into Python functions, which are more maintainable
Related
I have started a new job and there are 100k lines of code written in Python 2.7 across four different repos.
The code is sometimes quite nested, with many library imports and a complex class structure, and no documentation.
I want to create a graph of the dependencies in order to understand the code better.
I have not found anything on the internet except https://pypi.org/project/pydeps/ but that is not working for some unknown reason.
The solution should either query all python files in the four repos automatically, or it should take a single python file with some function call I have saved, and then go through all dependencies and graphically display them.
A good solution would also display which arguments or (keyword arguments) are passed on, or how often a function is used within the 100k lines of code to understand which methods are more important etc. This is not a strong requirement, however.
If someone could post one or more python libraries (or VSCode extensions) that would be much appreciated.
I'm generating python code using another python file, which I then execute using subprocess.
The issue is that the generated code include quite a few imports which makes it slow. I often work in ipython cells (which have memory), so the generation imports are retained. But, the imports of the generated files are reran everytime, because the new subprocess does not retain the imports.
Hence, I was wondering if there is a way to supply your subprocess with imports. This way the imports can be loaden once in my generation script and forwarded to the generated code. I did a little google and couldn't find much on the matter, so any input would be greatly appreciated!
Edit: To add some context. The reason I generate python code is because I automatically create classes from input. These classes are required for a the underlying model. Using exec might be possible, but would probably be relatively difficult to implement, plus I like having the ability to run the generated file on its own.
I'm on Windows 10, Python 2.7.13 installed via Anaconda. Recently I've been writing a lot of scripts to read/write data from files to other files, move them around, and do some visualizations with matplotlib. My workflow has been having an Anaconda Prompt open next to Sublime Text, and I copy/paste individual lines into my workspace to test something. This doesn't feel like a "best practice", especially because I can't copy/paste multiple lines with indents, so I have to write them out manually twice. I'd really like to find a better way to work on this. What would you recommend changing?
There are several types of software testing that vary in their complexity and what they test. Generally speaking, it is a good practice to leverage what is know as unit testing. Unit testing is the methodology of writing groups of tests where each test is responsible for testing a small "unit" of code. By only testing individual pieces of your project with each test, you are given a very granular idea of what parts of your project is working correctly and what parts are not working correctly. It also allows for your tests to be repeatable, source controlled, and automated. Typically each "unit" that a test is written for is a single callable item such as a function or method of a class.
In order to get the most out of unit testing, your functions and methods need to be single responsibility entities. This means they should perform one task and one task only. This makes it much easier to test them. Python's standard library has a built package, appropriately named unittest to perform this type of testing..
I would start looking at the unittest package's documentation. It provides more explanation on unit testing and how to use the package in your python code. You can also use the coverage package to determine how much of you code is tested via unit tests.
I hope this helps.
I use IPython Notebooks extensively in my research. I find them to be a wonderful tool.
However, on more than one occasion, I have been bitten by subtle bugs stemming from variable scope. For example, I will be doing some exploratory analysis:
foo = 1
bar = 2
foo + bar
And I decide that foo + bar is a useful algorithm for my purposes, so I encapsulate it in a function to make it easier to apply to a wider range of inputs:
def the_function(foo, bar):
return foo + bar
Inevitably, somewhere down the line, after building a workflow from the ground up, I will have a typo somewhere (e.g. def the_function(fooo, bar):) that causes a global variable to be used (and/or modified) in a function call. This causes unseen side effects and leads to spurious results. But because it typically returns a result, it can be difficult to find where the problem actually occurs.
Now, I recognize that this behavior is a feature, which I deliberately use often (for convenience, or for necessity i.e. function closures or decorators). But as I keep running into bugs, I'm thinking I need a better strategy for avoiding such problems (current strategy = "be careful").
For example, one strategy might be to always prepend '_' to local variable names. But I'm curious if there are not other strategies - even "pythonic" strategies, or community encouraged strategies.
I know that python 2.x differs in some regards to python 3.x in scoping - I use python 3.x.
Also, strategies should consider the interactive nature of scientific computing, as would be used in an IPython Notebook venue.
Thoughts?
EDIT: To be more specific, I am looking for IPython Notebook strategies.
I was tempted to flag this question as too broad, but perhaps the following will help you.
When you decide to wrap some useful code in a function, write some tests. If you think the code is useful, you must have used it with some examples. Write the test first lest you 'forget'.
My personal policy for a library module is to run the test in an if __name__
== '__main__': statement, whether the test code is in the same file or a different file. I also execute the file to run the tests multiple times during a programming session, after every small unit of change (trivial in Idle or similar IDE).
Use a code checker program, which will catch some typo-based errors. "'fooo' set but never used".
Keep track of the particular kinds of errors you make, analyze them and think about personal countermeasures, or at least learn to recognize the symptoms.
Looking at your example, when you do write a function, don't use the same names for both global objects and parameters. In your example, delete or change the global 'foo' and 'bar' or use something else for parameter names.
I would suggest that you separate your concerns. For your exploratory analysis, write your code in the iPython notebook, but when you've decided that there are some functions that are useful, instead, open up an editor and put your functions into a python file which you can then import.
You can use iPython magics to auto reload things you've imported. So once you've tested them in iPython, you can simply copy them to your module. This way, the scope of your functions is isolated from your notebook. An additional advantage is that when you're ready to run things in a headless environment, you already have your entire codebase in one place.
In the end, I made my own solution to the problem. It builds on both answers given so far.
You can find my solution, which is a cell magic extension, on github: https://github.com/brazilbean/modulemagic
In brief, this extension gives you the ability to create %%module cells in the notebook. These cells are saved as a file and imported back into your session. It effectively accomplishes what #shadanan had suggested, but allows you to keep all your work in the same place (convenient, and in line with the Notebook philosophy of providing code and results in the same place).
Because the import process sandboxes the code, it solves all of the scope shadowing errors that motivated my original question. It also involves little to no overhead to use - no renaming of variables, having other editors open, etc.
This questions is semi-based of this one here:
How can you profile a python script?
I thought that this would be a great idea to run on some of my programs. Although profiling from a batch file as explained in the aforementioned answer is possible, I think it would be even better to have this option in Eclipse. At the same time, making my entire program a function and profiling it would mean I have to alter the source code?
How can I configure eclipse such that I have the ability to run the profile command on my existing programs?
Any tips or suggestions are welcomed!
if you follow the common python idiom to make all your code, even the "existing programs", importable as modules, you could do exactly what you describe, without any additional hassle.
here is the specific idiom I am talking about, which turns your program's flow "upside-down" since the __name__ == '__main__' will be placed at the bottom of the file, once all your defs are done:
# program.py file
def foo():
""" analogous to a main(). do something here """
pass
# ... fill in rest of function def's here ...
# here is where the code execution and control flow will
# actually originate for your code, when program.py is
# invoked as a program. a very common Pythonism...
if __name__ == '__main__':
foo()
In my experience, it is quite easy to retrofit any existing scripts you have to follow this form, probably a couple minutes at most.
Since there are other benefits to having you program also a module, you'll find most python scripts out there actually do it this way. One benefit of doing it this way: anything python you write is potentially useable in module form, including cProfile-ing of your foo().
You can always make separate modules that do just profiling specific stuff in your other modules. You can organize modules like these in a separate package. That way you don't change your existing code.