Everywhere I see shared memory implementations for python (e.g. in multiprocessing), creating shared memory always allocates new memory. Is there a way to create a shared memory object and have it refer to existing memory? The purpose would be to pre-initialize the data values, or rather, to avoid having to copy into the new shared memory if we already have, say, an array in hand. In my experience, allocating a large shared array is much faster than copying values into it.
The short answer is no.
I'm the author of the Python extensions posix_ipc1 and sysv_ipc2. Like Python's multiprocessing module from the standard library, my modules are just wrappers around facilities provided by the operating system, so what you really need to know is what the OS allows when allocating shared memory. That differs a little for SysV IPC and POSIX IPC, but in this context the difference isn't really important. (I think multiprocessing uses POSIX IPC where possible.)
For SysV IPC, the OS-level call to allocate shared memory is shmget(). You can see on that call's man page that it doesn't accept a pointer to existing memory; it always allocates new memory for you. Ditto for the POSIX IPC version of the same call (shm_open()). POSIX IPC is interesting because it implements shared memory to look like a memory mapped file, so it behaves a bit differently from SysV IPC.
Regardless, whether one is calling from Python or C, there's no option to ask the operating system to turn an existing piece of private memory into shared memory.
If you think about it, you'll see why. Suppose you could pass a pointer to a chunk of private memory to shmget() or shm_open(). Now the operating system is stuck with the job of keeping that memory where it is until all sharing processes are done with it. What if it's in the middle of your stack? Suddenly this big chunk of your stack can't be allocated because other processes are using it. It also means that when your process dies, the OS can't release all its memory because some of it is now being used by other processes.
In short, what you're asking for from Python isn't offered because the underlying OS calls don't allow it, and the underlying OS calls don't allow it (probably) because it would be really messy for the OS.
Related
According to Python's multiprocessing documentation:
Data can be stored in a shared memory map using Value or Array.
Is shared memory treated differently than memory that is typically allocated to a process? Why does Python only support two data structures?
I'm guessing it has to do with garbage collection and is perhaps along the same reasons GIL exists. If this is the case, how/why are Value and Array implemented to be an exception to this?
I'm not remotely an expert on this, so def not a complete answer. There are a couple of things I think this considers:
Processes have their own memory space, so if we share "normal" variables between processes and try to write each process will have its own copy (perhaps using copy on write semantics).
Shared memory needs some sort of abstraction or primitive as it exists outside of process memory (SOURCE)
Value and Array, by default, are thread/process safe for concurrent use by guarding access with locks, handling allocations to shared memory AND protecting it :)
The attached documentation is able to answer, yes to:
is shared memory treated differently than memory that is typically allocated to a process?
I have many files on disk need to read, the 1st option is use multi-threads, it perform very well on SSD. (when threads blocked by IO, it will release GIL)
But I wanna achieve similar or faster speed without SSD, so I pre-load them into memory(like store in a dict), and every threads will read each file content from memory. Unfortunately, maybe because of the GIL, there is a lock in the dict, hence its speeds is even slower than loading files from SSD!
So my question is that is there any solution can create a read-only memory buffer without lock/GIL? like ramdisk or something else>
In short, no.
Even though Python (CPython in particular) is a multithread language, at any instant the interpreter can run only one piece of python code. Therefore if your pure python program does not contain blocking I/O (e.g. access lock-free memory buffer), it will degrade a single-threaded program no matter what you do. In face the performance will be worse than an actual single-threaded program because there is overhead in synchronizing with other threads.
(Special thanks Graham Dumpleton!) One of the solution is to write C extensions for CPython. And release GIL when enter the "realm of C". Just be careful that you can't access python stuff without the GIL protection otherwise it will cause subtle bugs, or crash directly.
There are several implementations that do not use GIL, for example, Jython and Cython (Not CPython). You can try using them. But keep in mind that writing a correct multithread program is hard. Writing a fast multithread program is even harder. My suggestion is to write multi-process program instead of multithread. And pass data via IPC or so (let's say, ZeroMQ, it's easy to use and lightweight).
Let me add few points to #HKTonyLee answer.
So Python has this GIL. But it is released when doing for example file I/O. This means that you can parallely read files. Since from processes point of view there is no such thing as file but only file descriptors (assuming posix) then whatever you read it does not have to be stored on the disk.
All in all, if you move your file to (for example) tmpfs or ramdisk or any equivalent then you should obtain even better performance then with SSD. Note however the risk: if you need to modify the file you may lose the update.
[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!
In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!
I'm working with Python multiprocessing to spawn some workers. Each of them should return an array that's a few MB in size.
Is it correct that since my return array is created in the child process, it needs to be copied back to the parent's memory when the process ends? (this seems to take a while, but it might be a pypy issue)
Is there a mechanism to allow the parent and child to access the same in-memory object? (synchronization is not an issue since only one child would access each object)
I'm afraid I have a few gaps in how python implements multi-processing, and trying to persuade pypy to play nice is not making things any easier. Thanks!
Yes, if the return array is created in the child process, it must be sent to the parent by pickling it, sending the pickled bytes back to the parent via a Pipe, and then unpickling the object in the parent. For a large object, this is pretty slow in CPython, so it's not just a PyPy issue. It is possible that performance is worse in PyPy, though; I haven't tried comparing the two, but this PyPy bug seems to suggest that multiprocessing in PyPy is slower than in CPython.
In CPython, there is a way to allocate ctypes objects in shared memory, via multiprocessing.sharedctypes. PyPy seems to support this API, too. The limitation (obviously) is that you're restricted to ctypes objects.
There is also multiprocessing.Manager, which would allow you to create a shared array/list object in a Manager process, and then both the parent and child could access the shared list via a Proxy object. The downside there is that read/write performance to the object is much slower than it would be as a local object, or even if it was a roughly equivalent object created using multiprocessing.sharedctypes.
We have a system that only has one interpreter. Many user scripts come through this interpreter. We want put a cap on each script's memory usage. There is only process, and that process invokes tasklets for each script. So since we only have one interpreter and one process, we don't know a way to put a cap on each scripts memory usage. What is the best way to do this
I don't think that it's possible at all. Your questions implies that the memory used by your tasklets is completly separated, which is probably not the case. Python is optimizing small objects like integers. As far as I know, for example each 3 in your code is using the same object, which is not a problem, because it is imutable. So if two of your tasklets use the same (small?) integer, they are already sharing memory. ;-)
Memory is separated at OS process level. There's no easy way to tell to which tasklet and even to which thread does a particular object belong.
Also, there's no easy way to add a custom bookkeeping allocator that would analyze which tasklet or thread is is allocating a piece of memory and prevent from allocating too much. It would also need to plug into garbage-collection code to discount objects which are freed.
Unless you're keen to write a custom Python interpreter, using a process per task is your best bet.
You don't even need to kill and respawn the interpreters every time you need to run another script. Pool several interpreters and only kill the ones that overgrow a certain memory threshold after running a script. Limit interpreters' memory consumption by means provided by OS if you need.
If you need to share large amounts of common data between the tasks, use shared memory; for smaller interactions, use sockets (with a messaging level above them as needed).
Yes, this might be slower than your current setup. But from your use of Python I suppose that in these scripts you don't do any time-critical computing anyway.