When you use Jupyter, you get these "numbered inputs & outputs" you can reference like this: _3. I'm the kind of guy who uses Jupyter like a nicer REPL with persistent code blocks & comments.
As time goes on, on long sessions, these kind of outputs start eating up memory, and then I have to restart the notebook and start all over.
I have NEVER in my life needed to reference these numbered variables (ok, maybe twice. Nothing I could not live without); so my question is: is there a way to disable them? Just to be clear: I still want to see my HTML-ified DataFrame, but I don't want the real df be saved into a variable named e.g. _143.
Related
Background:
I have a very long jupyther-notebook storing a lot of large numpy arrays.
As I use it for documenting a project, the jupyther notebook consists of several independent blocks and one import block (necessary for all other blocks). The notebook gets very slow, after many cells have been calculated, so I want to find a way to speed things up. The question below, seems the most solid and convenient solution to me at the moment, but I am open to other ideas.
My Question:
Is there a convenient way, to define independent blocks of a jupyther-notebook and execute them separately from each other with just a view clicks?
Ideas I had so far:
Always put the latest block on the top of my notebook (after the include statements). At the end of this block write a raise command to prevent the execution of further blocks: This is somehow messy and I can not execute blocks further down in the document by just a view clicks.
Split the notebook in separate notebook documents: This helps, but I want to keep better overview over my work.
delete all variables, which were used in the current block after it's execution: For whatever reason, this did not bring a considerable speedup. Is it possible, that I did something wrong here?
Start the browser I use for the jupyther-notebook with some nice value (I am using linux): This does not improve the performance in the notebook, but at least the computer keeps running fast and I can do something else on it, while waiting for the notebook.
The workaround I will end up, if I don't find a better solution here, is to define variables
actBlock1=False
actBlock2=True
actBlock3=False
and put if statements in all cells of a block. But I would prefer something which produces less unnecessary ifs and indents, to keep my work clean.
Thank you very much in advance,
You can take a look at the Jupyter Notebook Extensions package, and, in particular, at the Freeze extension. It will allow you to mark cells as "frozen" which means they cannot be executed (until you "unfreeze" them, that is).
For example, in this image:
The blue-shaded cells are "frozen" (you can select that with the asterisk button in the toolbar). After clicking "Run all" only the non-frozen cells have been executed.
Imagine that you working with a large dataset, distributed over a bunch of CSV files. You open an IPython notebook and explore stuff, do some transformations, reorder and clean up data.
Then you start doing some experiments with the data, create some more notebooks and in the end find yourself heaped up with a bunch of different notebooks which have data transformation pipelines buried in them.
How to organize data exploration/transformation/learning-from-it process in such a way, that:
complexity doesn't blow, raising gradually;
keep your codebase managable and navigable;
be able to reproduce and adjust data transformation pipelines?
Well, I have this problem now and then when working with a big set of data. Complexity is something I learned to live with, sometimes it's hard to keep things simple.
What i think that help's me a lot is putting all in a GIT repository, if you manage it well and make frequent commits with well written messages you can track the transformation to your data easily.
Every time I make some test, I create a new branch and do my work on it. If it gets to nowhere I just go back to my master branch and keep working from there, but the work I done is still available for reference if I need it.
If it leads to something useful I just merge it to my master branch and keep working on new tests, making new branches, as needed.
I don't think it answer all of your question and also don't know if you already use some sort version control in your notebooks, but that is something that helps me a lot and I really recommend it when using jupyter-notebooks.
Usually the main reason I'm using jupyter notebook with python is the possibility to initialize once (and only once) objects (or generally "data") that tend to have long (lets say more than 30 seconds) loading times. When my work is iterative, i.e. I run minimally changed version of some algorithm multiple times, the accumulated cost of repeated initialization can get large at end of a day.
I'm seeking an alternative approach (allowing to avoid the cost of repeated initialization without using a notebook) for the following reasons:
No "out of the box" version control when using notebook.
Occasional problems of "I forgot to rename variable in a single place". Everything keeps working OK until the notebook is restarted.
Usually I want to have usable python module at the end anyway.
Somehow when using a notebook I tend to get code that if far from "clean" (I guess this is more self discipline problem...).
Ideal workflow should allow to perform whole development inside IDE (e.g. pyCharm; BTW linux is the only option). Any ideas?
I'm thinking of implementing a simple (local) execution server that keeps the problematic objects pre-initialized as global variables and runs the code on demand (that uses those globals instead of performing initialization) by spawning a new process each time (this way those objects are protected from modification, at the same time thanks to those variables being global there is no pickle/unpickle penalty when spawning new process).
But before I start implementing this - maybe there is some working solution or workflow already known?
Visual Studio Code + Python extension works fine (both Windows and Mac, not sure about Linux). Very fast and lightweight, Git integration, debugging refactorings, etc.
Also there is an IDE called Spyder that is more Python-specific. Also works fine but is more heavy-weight.
I've been searching everywhere for an answer to this but to no avail. I want to be able to run my code and have the variables stored in memory so that I can perhaps set a "checkpoint" which I can run from in the future. The reason is that I have a fairly expensive function that takes some time to compute (as well as user input) and it would be nice if I didn't have to wait for it to finish every time I run after I change something downstream.
I'm sure a feature like this exists in PyCharm but I have no idea what it's called and the documentation isn't very clear to me at my level of experience. It would save me a lot of time if someone could point me in the right direction.
Turns out this is (more or less) possible by using the PyCharm console. I guess I should have realized this earlier because it seems so simple now (though I've never used a console in my life so I guess I should learn).
Anyway, the console lets you run blocks of your code presuming the required variables, functions, libraries, etc... have been specified beforehand. You can actually highlight a block of your code in the PyCharm editor, right click and select "Run in console" to execute it.
This feature is not implement in Pycharm (see pycharm forum) but seems implemented in Spyder.
I am using Python, but recently I am running a lot into the memory errors.
One is related to saving the plots in .png format. As soon as I try to save them in .pdf format I don't have this problem anymore. How can I still use .png for multiple files?
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
And finally, Python should release all the unused variables, but I think it's not doing so. If I run just one function I have no problem, but if I run two unrelated functions in the row (after finishing the first and before going to the second, in my understanding, all the variables should be released), during the second one, I run yet again into the memory error problem. Therefore I believe, the variables are not released after the first run. How can I force Python to release all of them (I don't want to use del, because there are loads of variables and I don't want to specify every single one of them).
Thanks for your help!
Looking at code would probably bring more clearance.
You can also try doing
import gc
f() #function that eats lots of memory while executing
gc.collect()
This will call the garbage collector and you will be sure that all abandoned objects are deleted. If that doesn't solve the problem, take a look at objgraph library http://mg.pov.lt/objgraph/objgraph.html in order to detect who leaks the memory or to find the places where you've forgotten to remove reference to a memory consuming object.
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
If you use with open(myfile1) as f1: ..., you don't need to worry about closing files or about accidentally leaving files opened.
See here for a good explanation.
As for the other questions, I agree with alex_jordan that it would help if you showed some of your code.