I'm trying to make a fast library for interprocess communication between any combination of Python and C/C++ processes. (i.e. Python <-> Python, Python <-> C++, or C++ <-> Python)
In the hopes of having the fastest implementation, I'm trying to utilize shared memory using mmap. The plan is for two processes to share memory by "mmap-ing" the same file and read from and write to this shared memory to communicate.
I want to avoid any actual writes to a real file, and instead simply want to use a filename as a handle for the two processes to connect. However, I get hung up on the following call to mmap:
self.memory = mmap.mmap(fileno, self.maxlen)
where I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: 'shared_memory_file'
or if I make an empty file:
ValueError: mmap length is greater than file size
Do I need to simply make an empty file filled with nulls in order to be able to use shared memory like this?
How can I use mmap for shared memory in Python between unrelated processes (not parent<->child communication) in a way which C++ can also play along? (not using multiprocessing.shared_memory)
To answer the questions directly as best I can:
The file needs to be sized appropriately before it can be mapped. If you need more space, there are different ways to do it ... but most portable is likely unmap the file, resize the file on disk, and then remap the file. See: How to portably extend a file accessed using mmap()
You might be able to mmap with MAP_ANONYMOUS|MAP_SHARED, then fork, then run with the same shared memory in both processes. See: Sharing memory between processes through the use of mmap()
Alternatively, you could create a ramdisk, create a file there of a specific size, and then mmap into both processes.
Keep in mind that you'll need to deal with synchronization between the two processes - different platforms might have different approaches to this, but they traditionally involve using a semaphore of some kind (e.g. on Linux: https://man7.org/linux/man-pages/man7/sem_overview.7.html).
All that being said, traditional shared memory will probably do better than mmap for this use-case. In general, OS-level IPC mechanisms are likely to do better out of the box than hand-rolled solutions - there's a lot of tuning that goes into something to make it perform well, and mmap isn't always an automatic win.
Good luck with the project!
Related
Everywhere I see shared memory implementations for python (e.g. in multiprocessing), creating shared memory always allocates new memory. Is there a way to create a shared memory object and have it refer to existing memory? The purpose would be to pre-initialize the data values, or rather, to avoid having to copy into the new shared memory if we already have, say, an array in hand. In my experience, allocating a large shared array is much faster than copying values into it.
The short answer is no.
I'm the author of the Python extensions posix_ipc1 and sysv_ipc2. Like Python's multiprocessing module from the standard library, my modules are just wrappers around facilities provided by the operating system, so what you really need to know is what the OS allows when allocating shared memory. That differs a little for SysV IPC and POSIX IPC, but in this context the difference isn't really important. (I think multiprocessing uses POSIX IPC where possible.)
For SysV IPC, the OS-level call to allocate shared memory is shmget(). You can see on that call's man page that it doesn't accept a pointer to existing memory; it always allocates new memory for you. Ditto for the POSIX IPC version of the same call (shm_open()). POSIX IPC is interesting because it implements shared memory to look like a memory mapped file, so it behaves a bit differently from SysV IPC.
Regardless, whether one is calling from Python or C, there's no option to ask the operating system to turn an existing piece of private memory into shared memory.
If you think about it, you'll see why. Suppose you could pass a pointer to a chunk of private memory to shmget() or shm_open(). Now the operating system is stuck with the job of keeping that memory where it is until all sharing processes are done with it. What if it's in the middle of your stack? Suddenly this big chunk of your stack can't be allocated because other processes are using it. It also means that when your process dies, the OS can't release all its memory because some of it is now being used by other processes.
In short, what you're asking for from Python isn't offered because the underlying OS calls don't allow it, and the underlying OS calls don't allow it (probably) because it would be really messy for the OS.
I have many files on disk need to read, the 1st option is use multi-threads, it perform very well on SSD. (when threads blocked by IO, it will release GIL)
But I wanna achieve similar or faster speed without SSD, so I pre-load them into memory(like store in a dict), and every threads will read each file content from memory. Unfortunately, maybe because of the GIL, there is a lock in the dict, hence its speeds is even slower than loading files from SSD!
So my question is that is there any solution can create a read-only memory buffer without lock/GIL? like ramdisk or something else>
In short, no.
Even though Python (CPython in particular) is a multithread language, at any instant the interpreter can run only one piece of python code. Therefore if your pure python program does not contain blocking I/O (e.g. access lock-free memory buffer), it will degrade a single-threaded program no matter what you do. In face the performance will be worse than an actual single-threaded program because there is overhead in synchronizing with other threads.
(Special thanks Graham Dumpleton!) One of the solution is to write C extensions for CPython. And release GIL when enter the "realm of C". Just be careful that you can't access python stuff without the GIL protection otherwise it will cause subtle bugs, or crash directly.
There are several implementations that do not use GIL, for example, Jython and Cython (Not CPython). You can try using them. But keep in mind that writing a correct multithread program is hard. Writing a fast multithread program is even harder. My suggestion is to write multi-process program instead of multithread. And pass data via IPC or so (let's say, ZeroMQ, it's easy to use and lightweight).
Let me add few points to #HKTonyLee answer.
So Python has this GIL. But it is released when doing for example file I/O. This means that you can parallely read files. Since from processes point of view there is no such thing as file but only file descriptors (assuming posix) then whatever you read it does not have to be stored on the disk.
All in all, if you move your file to (for example) tmpfs or ramdisk or any equivalent then you should obtain even better performance then with SSD. Note however the risk: if you need to modify the file you may lose the update.
[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!
In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!
I have a python process serving as a WSGI-apache server. I have many copies of this process running on each of several machines. About 200 megabytes of my process is read-only python data. I would like to place these data in a memory-mapped segment so that the processes could share a single copy of those data. Best would be to be able to attach to those data so they could be actual python 2.7 data objects rather than parsing them out of something like pickle or DBM or SQLite.
Does anyone have sample code or pointers to a project that has done this to share?
This post by #modelnine on StackOverflow provides a really great comprehensive answer to this question. As he mentioned, using threads rather than process-forking in your webserver can significantly lesson the impact of this. I ran into a similar problem trying to share extremely-large NumPy arrays between CLI Python processes using some type of shared memory a couple of years ago, and we ended up using a combination of a sharedmem Python extension to share data between the workers (which proved to leak memory in certain cases, but, it's fixable probably). A read-only mmap() technique might work for you, but I'm not sure how to do that in pure-python (NumPy has a memmapping technique explained here). I've never found any clear and simple answers to this question, but hopefully this can point you in some new directions. Let us know what you end up doing!
It's difficult to share actual python objects because they are bound to the process address space. However, if you use mmap, you can create very usable shared objects. I'd create one process to pre-load the data, and the rest could use it. I found quite a good blog post that describes how it can be done: http://blog.schmichael.com/2011/05/15/sharing-python-data-between-processes-using-mmap/
Since it's read-only data you won't need to share any updates between processes (since there won't be any updates) I propose you just keep a local copy of it in each process.
If memory constraints is an issue you can have a look at using multiprocessing.Value or multiprocessing.Array without locks for this: https://docs.python.org/2/library/multiprocessing.html#shared-ctypes-objects
Other than that you'll have to rely on an external process and some serialising to get this done, I'd have a look at Redis or Memcached if I were you.
One possibility is to create a C- or C++-extension that provides a Pythonic interface to your shared data. You could memory map 200MB of raw data, and then have the C- or C++-extension provide it to the WSGI-service. That is, you could have regular (unshared) python objects implemented in C, which fetch data from some kind of binary format in shared memory. I know this isn't exactly what you wanted, but this way the data would at least appear pythonic to the WSGI-app.
However, if your data consists of many many very small objects, then it becomes important that even the "entrypoints" are located in the shared memory (otherwise they will waste too much memory). That is, you'd have to make sure that the PyObject* pointers that make up the interface to your data, actually themselves point to the shared memory. I.e, the python objects themselves would have to be in shared memory. As far as I can read the official docs, this isn't really supported. However, you could always try "handcrafting" python objects in shared memory, and see if it works. I'm guessing it would work, until the Python interpreter tries to free the memory. But in your case, it won't, since it's long-lived and read-only.
I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!