I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.
For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:
import netCDF4 as nc
f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]
But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.
Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.
You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:
http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html
The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles)
you want for the variables. You can specify how much memory to use as a buffer and how much to
use for chunk caches, but it's not clear how to use memory optimally between these uses, so you
may have to just try some examples and time them. Rather than completely transpose a variable,
you probably want to "partially transpose" it, by specifying chunks that have a lot of data along
the 2 big dimensions of your slice and have only a few values along the other dimensions.
This is a comment, not an answer, but I can't comment on the above, sorry.
I understand that you want to process myvar[:,:,i], with i in range(450). In that case, you are going to do something like:
for i in range(450):
myslice = myvar[:,:,i]
do_something(slice)
and the bottleneck is in accessing myslice = myvar[:,:,i]. Have you tried comparing how long it takes to access moreslices = myvar[:,:,0:n]? It would be contiguos data, and maybe you can save time with that. You would choose n as large as your memory affords it, and then process the next chunk of data moreslices = myvar[:,:,n:2n] and so on.
Related
Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.
Zarr saves an array on disk in chunks, each chunk is a separate file. Is there a way to access only one, chosen chunk (file)?
Can it be determined which chunks are empty without loading the whole array into memory?
I'm not aware of any way to find chunk size except hitting the FS yourself. Zarr abstracts over that. Maybe you'll have to explain what you're up to.
The project I'm currently working on uses Zarr to store meteorological data. We keep the data in a 3 dimensional array of shape (t, x, y). Alongside the data, we have an array of shape (t), effectively a bitmask to record which slots are filled. So when data comes in, we write
data[t] = [...]
ready[t] = 1
So when querying for data we know at what timeslots to expect data, and which slots are empty.
It's possible to see what chunks are filled by looking at the keys method of the underlying chunk_store. Only keys with data will be filled.
The corresponding values of these keys will contain the data of that chunk, but it will be compressed. If you want more than that, would encourage you to raise an issue over at the Zarr repo.
I don't think there is a general solution to know which chunks are initialized for any storage type, but for DirectoryStore, it is possible to list the filesystem to know which chunks are initialized. This is how zarr do it to compute the nchunks_initialized property.
I suppose you could get some inspiration from there to list all initialized chunks and then compute which slice it corresponds to in the array.
While there is no object for a chunk in zarr, you can compute their beginning and end along each axis from the array dimensions and chunk dimensions. If you want to load the chunks one by one for efficiency reasons, you can compute their indices and slice the zarr Array to get a numpy array as a working area.
Since I had similar needs, I built some function as helpers to do just that, you can look them up at https://github.com/maxime915/msi_zarr_analysis/blob/126c1115bd43e8813d2f002673491c6ef25e37db/msi_zarr_analysis/utils/iter_chunks.py if you want some inspiration.
I am using NumPy memmap to load a small amount of data from various locations throughout a large binary file (memmap'd, reshaped, flipped around, and then around 2000x1000 points loaded from around a 2 GB binary file). There are five 2 GB files each with its own memory map object.
The memory maps are created all very quickly. And the slice of data from the first several files pulls out very quickly. But, then, it suddenly stops on the fourth and fifth file. Memory usage remains low, so, it does not appear to be reading the whole file into memory, but, I/O access from the process is high. It could easily take ten or fifteen minutes for this to clear, and then everything proceeds as expected. Subsequent access through all of the memory maps is extremely rapid, including loading data that was not previously touched. Memory usage remains low throughout. Closing Python and re-running, the problem does not reoccur until reboot (caching maybe?).
I'm on Windows 10 with Python 2.7. Any thoughts for troubleshooting?
EDIT: There was a request in the comments for file format type and example code. Unfortunately, I cannot provide exact details; however, I can say this much. The file format contains just int16 binary values for a 3D array which can be reshaped by the format [n1, n2, n3] where n* are the length for each dimension. However, the files are split at 2GB. So, they are loaded in like this:
memmaps = []
for filename in filelist:
memmaps.append(np.memmap(filename, dtype=np.int16, mode='r'))
memmaps[-1] = memmaps[-1].reshape([len(memmaps[-1])/n2/n3, n2, n3])
memmaps[-1] = np.transpose(memmaps[-1], [2,1,0])
This certainly isn't the cleanest code in the world, but it generally works, except for this seemingly random slow down. The user has a slider which allows them to plot a slice from this array as
image = np.zeros([n2, n1], dtype=np.int16)
#####
c = 0
for d in memmaps:
image[:,c:(c+d.shape[2])] = d[slice,:,:]
c = c + d.shape[2]
I'm leaving out a lot of detail, but I think this captures the most relevant information.
EDIT 2: Also, I am open to alternative approaches to handling this problem. My end goal is real time interactive plotting of an arbitrary and relatively small chunk of 2D data as an image from a large 3D dataset that may be split across multiple binary files. I'm presently using pyqtgraph with fairly reasonable results, except for this random problem.
I have RAM concerns, and I want to downsize my data I loaded (with read_stata() you cannot only a few rows, sadly). Can I change the code below to use only some rows for X and y, but not make a copy? That would, even if temporarily defeat the purpose, I want to save on memory, not add ever more to my footprint. Or probably downsize the data first (does `reshape' do that without a copy if you specify a smaller size than the original?) and then pick some columns?
data = pd.read_stata('S:/data/controls/notreat.dta')
X = data.iloc[:,1:]
y = data.iloc[:,0]
I feel your pain. Pandas is not a memory-friendly library, and 500Mb can quickly turn into >16Gb and shredding performance.
However, one thing that's worked for me is memmap. You can use memmap to page in numpy arrays and matrices just about as fast as your databus permits. And as an added benefit, unused pages may be unloaded.
See here for details. With some work, these memmap np arrays can be used to back a pd.Series or a pd.DataFrame without copying. However, you may find that Pandas later copies you data as you proceed. So, my advice: create a memmap file, and stay in numpy-land.
Your other alternative is to use HDFS.
I am working with .h5 files with little experience.
In a script I wrote I load in data from an .h5 file. The shape of the resulting array is: [3584, 3584, 75]. Here the values 3584 denotes the number of pixels, and 75 denotes the number of time frames. Loading the data and printing the shape takes 180 ms. I obtain this time using os.times().
If I now want to look at the data at a specific time frame I use the following piece of code:
data_1 = data[:, :, 1]
The slicing takes up a lot of time (1.76 s). I understand that my 2D array is huge but at some point I would like to loop over time which will take very long as I'm performing this slice within the for loop.
Is there a more effective/less time consuming way of slicing the time frames or handling this type of data?
Thank you!
Note: I'm making assumptions here since I'm unfamiliar with .H5 files and the Python code the accesses them.
I think that what is happening is that when you "load" the array, you're not actually loading an array. Instead, I think that an object is constructed on top of the file. It probably reads in dimensions and information related to how the file is organized, but it doesn't read the whole file.
That object mimicks an array so good that when you later on perform the slice operation, the normal Python slice operation can be executed, but at this point the actual data is being read. That's why the slice takes so long time compared to "loading" all the data.
I arrive at this conclusion because of the following.
If you're reading 75 frames of 3584x3584 pixels, I'm assuming they're uncompressed (H5 seems to be just raw dumps of data), and in that case, 75 * 3.584 * 3.584 = 963.379.200, this is around 918MB of data. Couple that with you "reading" this in 180ms, we get this calculation:
918MB / 180ms = 5.1GB/second reading speed
Note, this number is for 1-byte pixels, which is also unlikely.
This speed thus seems highly unlikely, as even the best SSDs today reach way below 1GB/sec.
It seems much more plausible that an object is just constructed on top of the file and the slice operation incurs the cost of reading at least 1 frame worth of data.
If we divide the speed by 75 to get per-frame speed, we get 68MB/sec speed for 1-byte pixels, and with 24 or 32-bit pixels we get up to 270MB/sec reading speeds. Much more plausible.