Merge CSV into HDF5 in pandas leads to crash - python

I have about 700 CSV files. They are each typically a few meg and few thousand rows. So, the total folder is ~1gig. I want to merge them into a single HDF5 file.
I first defined a function read_file(file) that reads a single file, and parses it using pd.read_csv(). It then returns a dataframe.
I then use this code to convert:
for file in files:
print (file + " Num: "+str(file_num)+" of: "+str(len(files)))
file_num=file_num+1
in_pd=read_file(file)
in_pd.to_hdf('AllFlightLogs.h5','flights',mode='a',append=True)
And, it works just fine for about 202 files, and then python crashes with: Abort trap: 6
I don't know what this error means. I have also seen it pop up a window showing a stack error.
I have tried using complib='lzo' and that doesn't seem to make any difference. I have tried saving to a different hdf5 file every 100 reads, and that does change the exact number of files before the crash. But, it still happens.
There doesn't seem to be anything special about that particular file. Is there anyway to find out anything else about this particular error? I know that the crash happens when I try to call in_pd.to_hdf() (I added print statements before and after).
I am running on a Mac, and using pandas 16.2.

I upgraded to 3.2.1 and that seems to have fixed it. So, it was not a problem with my code (which was driving me crazy), but was a pytables problem.

Adam's answer solved my problem on my imac.
But as of 1sep15, while pytables is available for linux and osx, it is still not for Windows - I use the Anaconda distribution (very good in every other respect). Does anybody know why ? Is there a specific reason for that ?

Related

When I copy and paste multiple chunks of Python code into iTerm, it often is not ready correctly unless I do it several times, how can I fix it?

When I copy and paste Python code for evaluation into iTerm, or the Mac terminal, often times it will come up with an error if I am pasting in many functions. I have to repeat the process often many times until I can get it to read correctly.
I suspect this is related to the speed at which iTerm reads the pasted lines. Is there a reason for this problem and is there a way for me to fix it? Thanks

Migrating Python code away from Jupyter and over to PyCharm

I'd like to take advantage of a number of features in PyCharm hence looking to port code over from my Notebooks. I've installed everything but am now faced with issues such as:
The Display function appears to fail hence dataframe outputs (used print) are not so nicely formatted. Equivalent function?
I'd like to replicate the n number of code cell in a Jupyter notebook. The Jupyter code is split over 9 cells in the one Jupyter file and shift+ Enteris an easy way to check outputs then move on. Now I've had to place all the code in the one Project/python file and have 1200 lines of code. Is there a way to section the code like it is in Jupyter? My VBA background envisions 9 routines and one additional calling routine to get the same result.
Each block of code is importing data from SQL Server and some flat files so there is some validation in between running them. I was hoping there was an alternative to manually selecting large chunks of code/executing and/or Breakpoints everytime it's run.
Any thoughts/links would be appreciated. I spent some $$ on Udemy on a PyCharm course but it does not help me with this one.
Peter
The migration part is solved in this question: convert json ipython notebook(.ipynb) to .py file, but perhaps you already knew that.
The code-splitting part is harder. One reason to why Jupyter is so widely spread is the functionality to split the output and run each cell separately. I would recommend #Andrews answer though.
If you are using classes put each class in a new file.

Python: Date comparing works in Spyder but not in Console

i have written a little csv parser based on pandas.
It works like a charm in Spyder 3.
Yesterday i tried to put it into production and run it with a .bat file, like:
python my_parser.py
In the console it doesn't work at all.
Pandas behaves different: The read_csv method lost the "quotechar" keyword argument, for example.
Especially date comparisons break all the time.
I read the dates with pandas as per
pd.read_csv(parse_dates=[col3, col5, col8])
Then i try a date calculation by substracting pd.to_datetime('now')
I tested everything, and as said, in Spyder no failure is thrown, it works and produces results as it should be.
As soon as i start it in the console, he throws type errors.
The most often one of the two dates is a mere string and the other stays a datetime, so the minus operation fails.
I could now rewrite the code and find a procedure that works in both, Spyder and console.
However, i prefer to ask you guys here:
What could be a possible reason that spyder and the console python behave completely different from each other?
It's really annoying to debug code that does not throw any failures, so i really would like to understand the cause.
The problem was related to having several python installations on my PC. After removing all of it and installing a single instance, it was working well. Thanks for the tipp, Carlos Cordoba!

Python crashes in rare cases when running code - how to debug?

I have a problem that I seriously spent months on now!
Essentially I am running code that requires to read from and save to HD5 files. I am using h5py for this.
It's very hard to debug because the problem (whatever it is) only occurs in like 5% of the cases (each run takes several hours) and when it gets there it crashes python completely so debugging with python itself is impossible. Using simple logs it's also impossible to pinpoint to the exact crashing situation - it appears to be very random, crashing at different points within the code, or with a lag.
I tried using OllyDbg to figure out whats happening and can safely conclude that it consistently crashes at the following location: http://i.imgur.com/c4X5W.png
It seems to be shortly after calling the python native PyObject_ClearWeakRefs, with an access violation error message. The weird thing is that the file is successfully written to. What would cause the access violation error? Or is that python internal (e.g. the stack?) and not file (i.e. my code) related?
Has anyone an idea whats happening here? If not, is there a smarter way of finding out what exactly is happening? maybe some hidden python logs or something I don't know about?
Thank you
PyObject_ClearWeakRefs is in the python interpreter itself. But if it only happens in a small number of runs, it could be hardware related. Things you could try:
Run your program on a different machine. if it doesn't crash there, it is probably a hardware issue.
Reinstall python, in case the installed version has somehow become corrupted.
Run a memory test program.
Thanks for all the answers. I ran two versions this time, one with a new python install and my same program, another one on my original computer/install, but replacing all HDF5 read/write procedures with numpy read/write procedures.
The program continued to crash on my second computer at odd times, but on my primary computer I had zero crashes with the changed code. I think it is thus safe to conclude that the problems were HDF5 or more specifically h5py related. It appears that more people encountered issues with h5py in that respect. Given that any error in my application translates to potentially large financial losses I decided to dump HDF5 completely in favor of other stable solutions.
Use a try catch statement. This can be put into the program in order to stop the program from crashing when erroneous data is entered

writing large netCDF4 file with python?

I am trying to use the netCDF4 package with python. I am ingesting close to 20mil records of data, 28 bytes each, and then I need to write the data to a netCDF4 file. Yesterday, I tried doing it all at once, and after an hour or so of execution, python stopped running the code with the very helpful error message:
Killed.
Anyway, doing this with subsections of the data, it becomes apparent that somewhere between 2,560,000 records and 5,120,000 records, the code doesn't have enough memory and has to start swapping. Performance is, of course, greatly reduced. So two questions:
1) Anyone know how to make this work more effeciently? One thing I am thinking is to somehow put subsections of data in incrementally, instead of doing it all at once. Anyone know how to do that? 2) I presume the "Killed" message happened when memory finally ran out, but I don't know. Can anyone shed any light on this?
Thanks.
Addendum: netCDF4 provides an answer to this problem, which you can see in the answer I have given to my own question. So for the moment, I can move forward. But here's another question: The netCDF4 answer will not work with netCDF3, and netCDF3 is not gone by a long shot. Anyone know how to resolve this problem in the framework of netCDF3? Thanks again.
It's hard to tell what you are doing without seeing code, but you could try using the sync command to flush the data in memory to disk after some amount of data has been written to the file:
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4.Dataset-class.html
There is a ready answer in netCDF4: declare the netCDF4 variable with some specified "chunksize". I used 10000, and everything proceeded very nicely. As I indicated in the edit to my answer, I would like to find a way to resolve this in netCDF3 also, since netDF3 is far from dead.

Categories

Resources