PyCharm's console hangs with "big" objects - python

I'm using PyCharm, and typically run parts of my scripts in its Python console for debugging purposes. However, when I have to run something on a "big" variable (that consumes a lot of memory), the console becomes very slow.
Say df is a huge pandas data frame, as soon as I type df. in the console, it won't react anymore for 10-15 seconds. I can't tell whether this is pandas specific, since the only "big" variables I use come from pandas.
I'm running the community edition 3.4.1, pandas 0.14, Python 2.7.3, on Mac OS X 10.9.4 (with 8 GB of ram).
Size of df:
In[94]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[94]: 2229198184

I've been having the same problem, and have yet to find a proper solution, but this is a workaround that I've been using, from PyCharm hanging for a long time in iPython console with big data.
Instead of typing df.columns type d.columns and then go back and add the f. This at least prevents the hanging, although it renders the auto-completion a bit useless.

Have you tried changing the "Variable Loading Policy" to "On Demand"?

Related

Migrating Python code away from Jupyter and over to PyCharm

I'd like to take advantage of a number of features in PyCharm hence looking to port code over from my Notebooks. I've installed everything but am now faced with issues such as:
The Display function appears to fail hence dataframe outputs (used print) are not so nicely formatted. Equivalent function?
I'd like to replicate the n number of code cell in a Jupyter notebook. The Jupyter code is split over 9 cells in the one Jupyter file and shift+ Enteris an easy way to check outputs then move on. Now I've had to place all the code in the one Project/python file and have 1200 lines of code. Is there a way to section the code like it is in Jupyter? My VBA background envisions 9 routines and one additional calling routine to get the same result.
Each block of code is importing data from SQL Server and some flat files so there is some validation in between running them. I was hoping there was an alternative to manually selecting large chunks of code/executing and/or Breakpoints everytime it's run.
Any thoughts/links would be appreciated. I spent some $$ on Udemy on a PyCharm course but it does not help me with this one.
Peter
The migration part is solved in this question: convert json ipython notebook(.ipynb) to .py file, but perhaps you already knew that.
The code-splitting part is harder. One reason to why Jupyter is so widely spread is the functionality to split the output and run each cell separately. I would recommend #Andrews answer though.
If you are using classes put each class in a new file.

Python: Date comparing works in Spyder but not in Console

i have written a little csv parser based on pandas.
It works like a charm in Spyder 3.
Yesterday i tried to put it into production and run it with a .bat file, like:
python my_parser.py
In the console it doesn't work at all.
Pandas behaves different: The read_csv method lost the "quotechar" keyword argument, for example.
Especially date comparisons break all the time.
I read the dates with pandas as per
pd.read_csv(parse_dates=[col3, col5, col8])
Then i try a date calculation by substracting pd.to_datetime('now')
I tested everything, and as said, in Spyder no failure is thrown, it works and produces results as it should be.
As soon as i start it in the console, he throws type errors.
The most often one of the two dates is a mere string and the other stays a datetime, so the minus operation fails.
I could now rewrite the code and find a procedure that works in both, Spyder and console.
However, i prefer to ask you guys here:
What could be a possible reason that spyder and the console python behave completely different from each other?
It's really annoying to debug code that does not throw any failures, so i really would like to understand the cause.
The problem was related to having several python installations on my PC. After removing all of it and installing a single instance, it was working well. Thanks for the tipp, Carlos Cordoba!

Alternative workflow to using jupyter notebook (aka how to avoid repetitive initialization delay)?

Usually the main reason I'm using jupyter notebook with python is the possibility to initialize once (and only once) objects (or generally "data") that tend to have long (lets say more than 30 seconds) loading times. When my work is iterative, i.e. I run minimally changed version of some algorithm multiple times, the accumulated cost of repeated initialization can get large at end of a day.
I'm seeking an alternative approach (allowing to avoid the cost of repeated initialization without using a notebook) for the following reasons:
No "out of the box" version control when using notebook.
Occasional problems of "I forgot to rename variable in a single place". Everything keeps working OK until the notebook is restarted.
Usually I want to have usable python module at the end anyway.
Somehow when using a notebook I tend to get code that if far from "clean" (I guess this is more self discipline problem...).
Ideal workflow should allow to perform whole development inside IDE (e.g. pyCharm; BTW linux is the only option). Any ideas?
I'm thinking of implementing a simple (local) execution server that keeps the problematic objects pre-initialized as global variables and runs the code on demand (that uses those globals instead of performing initialization) by spawning a new process each time (this way those objects are protected from modification, at the same time thanks to those variables being global there is no pickle/unpickle penalty when spawning new process).
But before I start implementing this - maybe there is some working solution or workflow already known?
Visual Studio Code + Python extension works fine (both Windows and Mac, not sure about Linux). Very fast and lightweight, Git integration, debugging refactorings, etc.
Also there is an IDE called Spyder that is more Python-specific. Also works fine but is more heavy-weight.

Python crashes in rare cases when running code - how to debug?

I have a problem that I seriously spent months on now!
Essentially I am running code that requires to read from and save to HD5 files. I am using h5py for this.
It's very hard to debug because the problem (whatever it is) only occurs in like 5% of the cases (each run takes several hours) and when it gets there it crashes python completely so debugging with python itself is impossible. Using simple logs it's also impossible to pinpoint to the exact crashing situation - it appears to be very random, crashing at different points within the code, or with a lag.
I tried using OllyDbg to figure out whats happening and can safely conclude that it consistently crashes at the following location: http://i.imgur.com/c4X5W.png
It seems to be shortly after calling the python native PyObject_ClearWeakRefs, with an access violation error message. The weird thing is that the file is successfully written to. What would cause the access violation error? Or is that python internal (e.g. the stack?) and not file (i.e. my code) related?
Has anyone an idea whats happening here? If not, is there a smarter way of finding out what exactly is happening? maybe some hidden python logs or something I don't know about?
Thank you
PyObject_ClearWeakRefs is in the python interpreter itself. But if it only happens in a small number of runs, it could be hardware related. Things you could try:
Run your program on a different machine. if it doesn't crash there, it is probably a hardware issue.
Reinstall python, in case the installed version has somehow become corrupted.
Run a memory test program.
Thanks for all the answers. I ran two versions this time, one with a new python install and my same program, another one on my original computer/install, but replacing all HDF5 read/write procedures with numpy read/write procedures.
The program continued to crash on my second computer at odd times, but on my primary computer I had zero crashes with the changed code. I think it is thus safe to conclude that the problems were HDF5 or more specifically h5py related. It appears that more people encountered issues with h5py in that respect. Given that any error in my application translates to potentially large financial losses I decided to dump HDF5 completely in favor of other stable solutions.
Use a try catch statement. This can be put into the program in order to stop the program from crashing when erroneous data is entered

Why is debugging in eclipse/pydev so slow for my python program?

I have a relatively simple (no classes) python 2.7 program. The first thing the program does is read an sqlite dbase into a dictionary. The database is large, but not huge, around 90Meg on disk. It takes about 20 seconds to read in. After reading in the database I initialize some variables, e.g.
localMax = 0
localMin = 0
firstTime = True
When I debug this program in Eclipse-3.7.0/pydev - even these simple lines - each single-step in the debugger eats up 100% of a core, and takes between 5 and 10 seconds. I can see the python process goes to 100% cpu for 10 seconds. Single-step... wait 10 seconds... single-step... wait 10 seconds... If I debug at the command line just using pdb, no problems. If I'm not debugging at all, the program runs at "normal" speed, nothing strange like in Eclipse.
I've reproduced this on a dual core Win7 PC w/ 4G memory, my 8 core Ubuntu box w/ 8G of memory, and even my Mac Air. How's that for multi-platform development! I kept thinking it would work somewhere. I'm never even close to running out of memory at any time.
On each Eclipse single-step, why does the python process jump to 100% CPU, and take 10 seconds?
Here is a good enough workaround, based on Mikko Ohtamaa's hint. I just verified the following on my Mac Air:
If I simply close the 'Variables' window in the Eclipse GUI, I can single step through the code at normal speed. Which is great, but, uh, I don't have the Variables window.
For any variable I want to see, I can hover my cursor over the variable and see the value. I didn't attempt to hover over my large dictionary that is the culprit here.
I can also right-click on any variable and add a 'Watch', which brings up an 'Expressions' window. In this case the variable is just a degenerate case (very simple case) of an 'expression.
So, the workaround for me is to close the Eclipse Variable window, and use the Expressions window to selectively view variables. A pain, but for the debugging I'm doing it is better than pdb.
I simply commented this line out:
np.set_printoptions(threshold = 'nan')
It seems eclipse is trying to keep up with too much information.

Categories

Resources