How can I mmap HDF5 data into multiple Python processes?

How can I mmap HDF5 data into multiple Python processes? - python

I am trying to load HDF5 data from a memory cache (memcached) or the network, and then query it (read only) from multiple Python processes, without making a separate copy of the whole data set. Intuitively I would like to mmap the image (as it would appear on disk) into the multiple processes, and then query it from Python.
I am finding this difficult to achieve, hence the question. Pointers/corrections appreciated.
Ideas I have explored so far
pytables - This looks the most promising, it supports rich interface for querying HDF5 data and it (unlike numpy) seems to work with the data without making a (process local) copy of the data. It even supports a method File.get_file_image() which would seem to get the file image. What I don't see how to construct a new File / FileNode from a memory image rather than a disk file.
h5py - Another way to get at HDF5 data, as with pytables, it seems to require a disk file. The option driver='core' looks promising, but I can't see how to provide an existing mmap'd region into it, rather than have it allocate its own.
numpy - A lower level approach, if I share my raw data via mmap, then I might be able to construct a numpy ndarray which can access this data. But the relevant constructor ndarray.__new__(buffer=...) says it will copy the data, and numpy views can only seem to be constructed from existing ndarrays, not raw buffers.
ctypes - Very lowest level approach (could possibly use multiprocessing's Value wrapper to help a little). If I use ctypes directly I can read my mmap'd data without issue, but I would lose all the structural information and help from numpy/pandas/pytables to query it.
Allocate disk space - I could just allocate a file, write out all the data, and then share it via pytables in all my processes. My understanding is this would be memory efficient because pytables doesn't copy (until required) and obviously the processes would share the OS disk cache of the underlying file image. My objection is it is ugly and brings disk I/O into what I would like to be a pure memory system.

I think the situation should be updated now.
If a disk file is desirable, Numpy now has a standard, dedicated ndarray subclass:
numpy.memmap
UPDATE:
After looked into the implementation of multiprocessing.sharedctypes (CPython 3.6.2 shared memory block allocation code), I found that it always creates tmp files to be mmaped, so is not really a file-less solution.
If only pure RAM based sharing is expected, some one has demoed it with multiprocessing.RawArray:
test of shared memory array / numpy integration

mmap + the core driver w/ H5py for in-memory read only access. I submitted a patch for H5py to work with file-images a while ago for scenarios like this. Unfortunately it got rejected because upstream didn't want to give users the ability to shoot themselves in the foot and safe buffer management (via the c buffer protocol Python 2.7 introduced) but this required changes HDF's side which I haven't gotten around to. Still, if this is important to you and you are careful and capable of building pyHDF yourself, take a look at the patch/pull request here

Related

Loading large files into memory with Python

When doing work with large files and datasets (usually 1 or 2 gb+), the process is killed do to running out of RAM. What tools and methods are available to allow saving memory, while allowing the necessary functions, such as iteration over the entire file, and accessing and assigning other large variables. Due to the need to have access to the entire file in read mode, I am unsure of solutions for the given problem. Thanks for any help.
For reference, the project I am currently encountering this problem in is right here (dev branch).

Generally, you can use memory-mapped files so not to map a section of a virtual memory in a storage device. This enable you to operate on a use space of memory mapped that would not fit in RAM. Note that this is significantly slower than RAM though (there is no free lunch). You can use Numpy to do that quite transparently with numpy.memmap. Alternatively, there is mmap. For sake of performance, you can operate on chunks on read write them once from the memory-mapped section.

Python Shared Memory using mmap and empty files

I'm trying to make a fast library for interprocess communication between any combination of Python and C/C++ processes. (i.e. Python <-> Python, Python <-> C++, or C++ <-> Python)
In the hopes of having the fastest implementation, I'm trying to utilize shared memory using mmap. The plan is for two processes to share memory by "mmap-ing" the same file and read from and write to this shared memory to communicate.
I want to avoid any actual writes to a real file, and instead simply want to use a filename as a handle for the two processes to connect. However, I get hung up on the following call to mmap:
self.memory = mmap.mmap(fileno, self.maxlen)
where I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: 'shared_memory_file'
or if I make an empty file:
ValueError: mmap length is greater than file size
Do I need to simply make an empty file filled with nulls in order to be able to use shared memory like this?
How can I use mmap for shared memory in Python between unrelated processes (not parent<->child communication) in a way which C++ can also play along? (not using multiprocessing.shared_memory)

To answer the questions directly as best I can:
The file needs to be sized appropriately before it can be mapped. If you need more space, there are different ways to do it ... but most portable is likely unmap the file, resize the file on disk, and then remap the file. See: How to portably extend a file accessed using mmap()
You might be able to mmap with MAP_ANONYMOUS|MAP_SHARED, then fork, then run with the same shared memory in both processes. See: Sharing memory between processes through the use of mmap()
Alternatively, you could create a ramdisk, create a file there of a specific size, and then mmap into both processes.
Keep in mind that you'll need to deal with synchronization between the two processes - different platforms might have different approaches to this, but they traditionally involve using a semaphore of some kind (e.g. on Linux: https://man7.org/linux/man-pages/man7/sem_overview.7.html).
All that being said, traditional shared memory will probably do better than mmap for this use-case. In general, OS-level IPC mechanisms are likely to do better out of the box than hand-rolled solutions - there's a lot of tuning that goes into something to make it perform well, and mmap isn't always an automatic win.
Good luck with the project!

How to store easily python usable read-only data structures in shared memory

I have a python process serving as a WSGI-apache server. I have many copies of this process running on each of several machines. About 200 megabytes of my process is read-only python data. I would like to place these data in a memory-mapped segment so that the processes could share a single copy of those data. Best would be to be able to attach to those data so they could be actual python 2.7 data objects rather than parsing them out of something like pickle or DBM or SQLite.
Does anyone have sample code or pointers to a project that has done this to share?

This post by #modelnine on StackOverflow provides a really great comprehensive answer to this question. As he mentioned, using threads rather than process-forking in your webserver can significantly lesson the impact of this. I ran into a similar problem trying to share extremely-large NumPy arrays between CLI Python processes using some type of shared memory a couple of years ago, and we ended up using a combination of a sharedmem Python extension to share data between the workers (which proved to leak memory in certain cases, but, it's fixable probably). A read-only mmap() technique might work for you, but I'm not sure how to do that in pure-python (NumPy has a memmapping technique explained here). I've never found any clear and simple answers to this question, but hopefully this can point you in some new directions. Let us know what you end up doing!

It's difficult to share actual python objects because they are bound to the process address space. However, if you use mmap, you can create very usable shared objects. I'd create one process to pre-load the data, and the rest could use it. I found quite a good blog post that describes how it can be done: http://blog.schmichael.com/2011/05/15/sharing-python-data-between-processes-using-mmap/

Since it's read-only data you won't need to share any updates between processes (since there won't be any updates) I propose you just keep a local copy of it in each process.
If memory constraints is an issue you can have a look at using multiprocessing.Value or multiprocessing.Array without locks for this: https://docs.python.org/2/library/multiprocessing.html#shared-ctypes-objects
Other than that you'll have to rely on an external process and some serialising to get this done, I'd have a look at Redis or Memcached if I were you.

One possibility is to create a C- or C++-extension that provides a Pythonic interface to your shared data. You could memory map 200MB of raw data, and then have the C- or C++-extension provide it to the WSGI-service. That is, you could have regular (unshared) python objects implemented in C, which fetch data from some kind of binary format in shared memory. I know this isn't exactly what you wanted, but this way the data would at least appear pythonic to the WSGI-app.
However, if your data consists of many many very small objects, then it becomes important that even the "entrypoints" are located in the shared memory (otherwise they will waste too much memory). That is, you'd have to make sure that the PyObject* pointers that make up the interface to your data, actually themselves point to the shared memory. I.e, the python objects themselves would have to be in shared memory. As far as I can read the official docs, this isn't really supported. However, you could always try "handcrafting" python objects in shared memory, and see if it works. I'm guessing it would work, until the Python interpreter tries to free the memory. But in your case, it won't, since it's long-lived and read-only.

Efficiently write gigabytes of data to disk in Python

On Python v2.7 in Windows and Linux, what is the most efficient and quick way to sequentially write 5GB of data to a local disk (fixed or removable)? This data will not soon be read and does not need cached.
It seems the normal ways of writing use the OS disk cache (because the system assumes it may re-read this data soon). This clears useful data of of the cache, making the system slower.
Right now I am using f.write() with 65535 bytes of data at a time.

The real reason your OS uses the disk cache isn't that it assumes the data will be re-read -- it's that it wants to speed up the writes. You want to use the OS's write cache as aggressively as you possibly can.
That being said, the "standard" way to do high-performance, high-volume I/O in any language (and probably the most aggressive way to use the OS's read/write caches) is to use memory-mapped I/O. The mmap module (https://docs.python.org/2/library/mmap.html) will provide that, and depending on how you generate your data in the first place, you might even be able to gain more performance by dumping it to the buffer earlier.
Note that with a dataset as big as yours, it'll only work on a 64-bit machine (Python's mmap on 32-bit is limited to 4GiB buffers).
If you want more specific advice, you'll have to give us more info on how you generate your data.

This answer is relevant for Windows code, I have no idea about the Linux equivalent though I imagine the advice is similar.
If you want to be write the fastest code possible, then write using the Win32API and make sure you read the relevant section of CreateFile. Specifically make sure you do not make the classic mistake of using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags to open a file, for more explanation see Raymond Chen's classic blog post.
If you insist of writing at some multiple of sector or cluster size, then don't be beholden to the magic number of 65535 (why this number? It's no real multiple). Instead using GetDiskFreeSpace figure out the appropriate sector size, though even this is no real guarantee (some data may be kept with the NTFS file information).

How to deserialize 1GB of objects into Python faster than cPickle?

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.