writing large netCDF4 file with python?

writing large netCDF4 file with python? - python

I am trying to use the netCDF4 package with python. I am ingesting close to 20mil records of data, 28 bytes each, and then I need to write the data to a netCDF4 file. Yesterday, I tried doing it all at once, and after an hour or so of execution, python stopped running the code with the very helpful error message:
Killed.
Anyway, doing this with subsections of the data, it becomes apparent that somewhere between 2,560,000 records and 5,120,000 records, the code doesn't have enough memory and has to start swapping. Performance is, of course, greatly reduced. So two questions:
1) Anyone know how to make this work more effeciently? One thing I am thinking is to somehow put subsections of data in incrementally, instead of doing it all at once. Anyone know how to do that? 2) I presume the "Killed" message happened when memory finally ran out, but I don't know. Can anyone shed any light on this?
Thanks.
Addendum: netCDF4 provides an answer to this problem, which you can see in the answer I have given to my own question. So for the moment, I can move forward. But here's another question: The netCDF4 answer will not work with netCDF3, and netCDF3 is not gone by a long shot. Anyone know how to resolve this problem in the framework of netCDF3? Thanks again.

It's hard to tell what you are doing without seeing code, but you could try using the sync command to flush the data in memory to disk after some amount of data has been written to the file:
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4.Dataset-class.html

There is a ready answer in netCDF4: declare the netCDF4 variable with some specified "chunksize". I used 10000, and everything proceeded very nicely. As I indicated in the edit to my answer, I would like to find a way to resolve this in netCDF3 also, since netDF3 is far from dead.

Related

How to collaborate with teammate on code more efficiently?

I encounter some problems when collaborate with my teammate on some Python project. There are some difference between the usage of the function. For example, we want to read the xlsx file, pd.read_excel(f'D:\\Financial\\Data\\TRD_Dalyr.xlsx'), this works on my computer but not on my teammate, and pd.read_excel('.\Data\TRD_Dalyr.xlsx') works for his computer but not mine. So every time, he modify the file and send back to me, I have to fix those problem. Is there a more efficient way to deal with this kind of problem?

scipy.fft hangs with certain sound files

scipy.fft seems to hang when running this simple script:
import scipy
from scipy.io import wavfile
sound = 'sounds/silence/iPhone5.wav'
fs, data = wavfile.read(sound)
print scipy.fft(data)
on certain files. Try this file for example.
A few things I noticed:
Running the individual commands from the interactive interpreter does not hang.
Running with other sound files does not always hang the script (it's not just this file that isn't working though)
Sometimes I get WavFileWarning: chunk not understood, but it doesn't seem to be related to when it happens
If I terminate the script with Ctrl+C I get the result as if it never got stuck.
Opening the file with wave or audiolab leads to the same result.
Is this a bug or am I doing something wrong?

Check the value of data.shape for the files that hang up the system. If your data length happens to be a prime number, or the product of several large prime numbers, there isn't much that the FFT algorithm can do to speed up calculation of the DFT. If you pad with zeros, or trim your data to the nearest power of 2, everything should run much, much faster.

This should have been a comment, but there's just not enough space there...
You could do a bit more debugging, which might help a bit.
(Assuming you're on some sort of unix-like OS)
When the program gets stuck, does it idle or use a lot of CPU? You could use "top" or similar to check.
What is the program doing when it appears stuck? Can you get a stack trace? Either using a debugger like gdb or some other tool.
And I guess what really should be step one. Search the net for your symptoms. If it is a bug, it is likely already found and reported. It might even be fixed already.
By looking at a stack trace it should be possible to see if the program is stuck waiting for something, stuck in a loop somewhere or just doing lots of work.
It might also be able to tell you if the problem is in python code, C extensions or somewhere else. Being used to read stack traces is of course a plus. :)

various memory errors, python

I am using Python, but recently I am running a lot into the memory errors.
One is related to saving the plots in .png format. As soon as I try to save them in .pdf format I don't have this problem anymore. How can I still use .png for multiple files?
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
And finally, Python should release all the unused variables, but I think it's not doing so. If I run just one function I have no problem, but if I run two unrelated functions in the row (after finishing the first and before going to the second, in my understanding, all the variables should be released), during the second one, I run yet again into the memory error problem. Therefore I believe, the variables are not released after the first run. How can I force Python to release all of them (I don't want to use del, because there are loads of variables and I don't want to specify every single one of them).
Thanks for your help!

Looking at code would probably bring more clearance.
You can also try doing
import gc
f() #function that eats lots of memory while executing
gc.collect()
This will call the garbage collector and you will be sure that all abandoned objects are deleted. If that doesn't solve the problem, take a look at objgraph library http://mg.pov.lt/objgraph/objgraph.html in order to detect who leaks the memory or to find the places where you've forgotten to remove reference to a memory consuming object.

Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
If you use with open(myfile1) as f1: ..., you don't need to worry about closing files or about accidentally leaving files opened.
See here for a good explanation.
As for the other questions, I agree with alex_jordan that it would help if you showed some of your code.

Optimization of successive processing of large audio collection with several programs

I was given a task to process a large collection of audiofiles. Each file must be processed in four steps:
convertion from .wav into raw pcm,
resampling,
quantization
coding with one of three speech codecs.
Each step corresponds to a program taking a file as input and returning a file as output. Processing file by file seems to take long. How can I optimize the procedure? E.g. parrallel programming or something? I tried to make use of ramdisk to reduce the time spent to file reading/writing but it didn't give improvement. (Why?)
I'm writing in Python under Ubuntu Linux. Thanks in advance.

Reading and writing to disk is pretty slow. If each program result is being written to disk then it would be better to stop that from happening. Sockets seem like a good fit to me. Read more here: http://docs.python.org/library/ipc.html
Parallel program is nice... need more info before I can say much more on this topic. I remember reading a while ago about python not handling threading so efficiently, so that might not be the best bet. As far as I recall it just emulates parallel processing by switching between tasks really gosh darn quickly. so that wont help. This may have changed since I've worked with threading.... Extra processes on the other hand sound like a good idea.
If you need a less-vague answer please supply specifics in your question.
EDIT
The thing i read a while ago about threads looks like this: http://docs.python.org/2/glossary.html#term-global-interpreter-lock

Python crashes in rare cases when running code - how to debug?

I have a problem that I seriously spent months on now!
Essentially I am running code that requires to read from and save to HD5 files. I am using h5py for this.
It's very hard to debug because the problem (whatever it is) only occurs in like 5% of the cases (each run takes several hours) and when it gets there it crashes python completely so debugging with python itself is impossible. Using simple logs it's also impossible to pinpoint to the exact crashing situation - it appears to be very random, crashing at different points within the code, or with a lag.
I tried using OllyDbg to figure out whats happening and can safely conclude that it consistently crashes at the following location: http://i.imgur.com/c4X5W.png
It seems to be shortly after calling the python native PyObject_ClearWeakRefs, with an access violation error message. The weird thing is that the file is successfully written to. What would cause the access violation error? Or is that python internal (e.g. the stack?) and not file (i.e. my code) related?
Has anyone an idea whats happening here? If not, is there a smarter way of finding out what exactly is happening? maybe some hidden python logs or something I don't know about?
Thank you

PyObject_ClearWeakRefs is in the python interpreter itself. But if it only happens in a small number of runs, it could be hardware related. Things you could try:
Run your program on a different machine. if it doesn't crash there, it is probably a hardware issue.
Reinstall python, in case the installed version has somehow become corrupted.
Run a memory test program.

Thanks for all the answers. I ran two versions this time, one with a new python install and my same program, another one on my original computer/install, but replacing all HDF5 read/write procedures with numpy read/write procedures.
The program continued to crash on my second computer at odd times, but on my primary computer I had zero crashes with the changed code. I think it is thus safe to conclude that the problems were HDF5 or more specifically h5py related. It appears that more people encountered issues with h5py in that respect. Given that any error in my application translates to potentially large financial losses I decided to dump HDF5 completely in favor of other stable solutions.

Use a try catch statement. This can be put into the program in order to stop the program from crashing when erroneous data is entered

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

writing large netCDF4 file with python? - python

It's hard to tell what you are doing without seeing code, but you could try using the sync command to flush the data in memory to disk after some amount of data has been written to the file: http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4.Dataset-class.html

There is a ready answer in netCDF4: declare the netCDF4 variable with some specified "chunksize". I used 10000, and everything proceeded very nicely. As I indicated in the edit to my answer, I would like to find a way to resolve this in netCDF3 also, since netDF3 is far from dead.

Related

How to collaborate with teammate on code more efficiently?

scipy.fft hangs with certain sound files

various memory errors, python

Optimization of successive processing of large audio collection with several programs

Python crashes in rare cases when running code - how to debug?

Categories

Resources