In my Python C Extension I am performing actions on an iterable of strings. So in a first step I call PySequence_Fast to convert it to a list and then iterate over the elements. For each string I use PyUnicode_DATA and then compare the strings using some criteria. So I only read from PyObjects, but never modify them.
Now I would like to process the list in parallel, which would require me to release the GIL. However I do not know which effects this has on my use case. Here are my current thoughts:
I can still use those APIs, since they are only macros, that directly read from the PyObjects without modifying them.
I have to use the APIs beforehand and store a array of structs that hold kind, length and data pointer of the strings
I have to use the APis beforehand and have to store a copy of the strings in a array
Case 1 would be the most performant and memory efficient. However it is stated, that without acquiring the GIL it is not allowed to perform on Python objects (does this include reading access) or use Python/C API functions.
Case 2 would be the next most efficient, since at least I do not have to copy all strings. However when I am not allowed to read from Python objects while the GIL is released, I wonder whether I would even be allowed to use a pointer to the data inside the PyObject.
Case 3 would require me to copy all strings. In my case this might make the multithreaded solution slower than a sequential solutions.
I hope someone can help me understand what I am allowed to do while the GIL is released.
I think the official answer is that you should not do method 1 and should use methods 2 and 3. And that while it might work now it could change in the future and break. This is especially important if you want to support things like PyPy's C-API wrapper (which might well use a different representation that Python internally). There are increasing moves to try to hide implementation details that you slightly risk getting caught out by.
Practically I think method 1 would work fine provided you only use the macro forms with no error checking - the GIL is mainly about stopping simultaneous writes putting Python objects in an undefined state, and you aren't doing this. Where I'd be slightly careful is if you ever have (deprecated) "non-canonical" unicode objects - things that look "macro-y" like PyUnicode_READY can cause them to be modified to the canonical state. Again, be especially wary of alternative (non-CPython) implementations of the C-API.
One alternative to consider would be to use the buffer protocol instead. Although I can't find it explicitly stated in the docs, the idea is that PyObject_GetBuffer and PyBuffer_Release require the GIL but reading/writing to the buffer doesn't. Here I have two sub-suggestions:
can you have a single object like a Numpy array that exposes all your strings as a buffer?
you can also get a buffer from a unicode object (as a utf-8 C-string) - the thing to do would be to create all the buffers with the GIL, do your parallel processing without, and them free them with the GIL. It's possible that the overhead for this might be inefficient. This is basically an "official" version of method 2.
I short, you'd probably get away with it, but if it ever breaks I doubt that a bug report to Python would be well-received (since it's technically wrong)
According to Python's multiprocessing documentation:
Data can be stored in a shared memory map using Value or Array.
Is shared memory treated differently than memory that is typically allocated to a process? Why does Python only support two data structures?
I'm guessing it has to do with garbage collection and is perhaps along the same reasons GIL exists. If this is the case, how/why are Value and Array implemented to be an exception to this?
I'm not remotely an expert on this, so def not a complete answer. There are a couple of things I think this considers:
Processes have their own memory space, so if we share "normal" variables between processes and try to write each process will have its own copy (perhaps using copy on write semantics).
Shared memory needs some sort of abstraction or primitive as it exists outside of process memory (SOURCE)
Value and Array, by default, are thread/process safe for concurrent use by guarding access with locks, handling allocations to shared memory AND protecting it :)
The attached documentation is able to answer, yes to:
is shared memory treated differently than memory that is typically allocated to a process?
I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:
read an input file
process the file and create a list of triangles, represented by their vertices
output the vertices in the OFF format: a list of vertices followed by a list of triangles. The triangles are represented by indices into the list of vertices
The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I'm getting memory errors because of the sizes of the lists.
What is the best way to tell Python that I no longer need some of the data, and it can be freed?
According to Python Official Documentation, you can explicitly invoke the Garbage Collector to release unreferenced memory with gc.collect(). Example:
import gc
gc.collect()
You should do that after marking what you want to discard using del:
del my_array
del my_object
gc.collect()
Unfortunately (depending on your version and release of Python) some types of objects use "free lists" which are a neat local optimization but may cause memory fragmentation, specifically by making more and more memory "earmarked" for only objects of a certain type and thereby unavailable to the "general fund".
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.
In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you're all done with them).
The del statement might be of use, but IIRC it isn't guaranteed to free the memory. The docs are here ... and a why it isn't released is here.
I have heard people on Linux and Unix-type systems forking a python process to do some work, getting results and then killing it.
This article has notes on the Python garbage collector, but I think lack of memory control is the downside to managed memory
Python is garbage-collected, so if you reduce the size of your list, it will reclaim memory. You can also use the "del" statement to get rid of a variable completely:
biglist = [blah,blah,blah]
#...
del biglist
(del can be your friend, as it marks objects as being deletable when there no other references to them. Now, often the CPython interpreter keeps this memory for later use, so your operating system might not see the "freed" memory.)
Maybe you would not run into any memory problem in the first place by using a more compact structure for your data.
Thus, lists of numbers are much less memory-efficient than the format used by the standard array module or the third-party numpy module. You would save memory by putting your vertices in a NumPy 3xN array and your triangles in an N-element array.
You can't explicitly free memory. What you need to do is to make sure you don't keep references to objects. They will then be garbage collected, freeing the memory.
In your case, when you need large lists, you typically need to reorganize the code, typically using generators/iterators instead. That way you don't need to have the large lists in memory at all.
I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000x200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don't know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.
To fix both the memory and performance issues I switched to the use of a multithreading trick I read once somewhere (I'm sorry, I cannot find the related post anymore). Before I was reading each line of the file in a big for loop, processing it, and running gc.collect() every once and a while to free memory space. Now I call a function that reads and processes a chunk of the file in a new thread. Once the thread ends, the memory is automatically freed without the strange performance issue.
Practically it works like this:
from dask import delayed # this module wraps the multithreading
def f(storage, index, chunk_size): # the processing function
# read the chunk of size chunk_size starting at index in the file
# process it using data in storage if needed
# append data needed for further computations to storage
return storage
partial_result = delayed([]) # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100 # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
# we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
partial_result = delayed(f)(partial_result, index, chunk_size)
# no computations are done yet !
# dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
# passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
# it also allows you to use the results of the processing of the previous chunks in the file if needed
# this launches all the computations
result = partial_result.compute()
# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided
Others have posted some ways that you might be able to "coax" the Python interpreter into freeing the memory (or otherwise avoid having memory problems). Chances are you should try their ideas out first. However, I feel it important to give you a direct answer to your question.
There isn't really any way to directly tell Python to free memory. The fact of that matter is that if you want that low a level of control, you're going to have to write an extension in C or C++.
That said, there are some tools to help with this:
cython
swig
boost python
As other answers already say, Python can keep from releasing memory to the OS even if it's no longer in use by Python code (so gc.collect() doesn't free anything) especially in a long-running program. Anyway if you're on Linux you can try to release memory by invoking directly the libc function malloc_trim (man page).
Something like:
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
If you don't care about vertex reuse, you could have two output files--one for vertices and one for triangles. Then append the triangle file to the vertex file when you are done.
Here's the situation: I have a massive object that needs to be loaded into memory. So big that if it is loaded in twice it will go beyond the available memory on my machine (and no, I can't upgrade the memory). I also can't divide it up into any smaller pieces. For simplicity's sake, let's just say the object is 600 MB and I only have 1 GB of RAM. I need to use this object from a web app, which is running in multiple processes, and I don't control how they're spawned (a third party load balancer does that), so I can't rely on just creating the object in some master thread/process and then spawning off children. This also eliminates the possibility of using something like POSH because that relies on it's own custom fork call. I also can't use something like a SQLite memory database, mmap or the posix_ipc, sysv_ipc, and shm modules because those act as a file in memory, and this data has to be an object for me to use it. Using one of those I would have to read it as a file and then turn it into an object in each individual process and BAM, segmentation fault from going over the machine's memory limit because I just tried to load in a second copy.
There must be someway to store a Python object in memory (and not as a file/string/serialized/pickled) and have it be accessible from any process. I just don't know what it is. I've looked all over StackOverflow and Google and can't find the answer to this, so I'm hoping somebody can help me out.
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes
Look for shared memory, or Server process. After re-reading your post Server process sounds closer to what you want.
http://en.wikipedia.org/wiki/Shared_memory
There must be someway to store a Python object in memory (and not as a
file/string/serialized/pickled) and have it be accessible from any
process.
That isn't the way in works. Python object reference counting and an object's internal pointers do not make sense across multiple processes.
If the data doesn't have to be an actual Python object, you can try working on the raw data stored in mmap() or in a database or somesuch.
I would implement this as a C module that gets imported into each Python script. Then the interface to this large object would be implemented in C, or some combination of C and Python.
I need to call a function in a C library from python, which would free() the parameter.
So I tried create_string_buffer(), but it seems like that this buffer would be freed by Python later, and this would make the buffer be freed twice.
I read on the web that Python would refcount the buffers, and free them when there is no reference. So how can I create a buffer which python would not care about it afterwards? Thanks.
example:
I load the dll with: lib = cdll.LoadLibrary("libxxx.so") and then call the function with: path = create_string_buffer(topdir) and lib.load(path). However, the load function in the libxxx.so would free its argument. And later "path" would be freed by Python, so it is freed twice
Try the following in the given order:
Try by all means to manage your memory in Python, for example using create_string_buffer(). If you can control the behaviour of the C function, modify it to not free() the buffer.
If the library function you call frees the buffer after using it, there must be some library function that allocates the buffer (or the library is broken).
Of course you could call malloc() via ctypes, but this would break all good practices on memory management. Use it as a last resort. Almost certainly, this will introduce hard to find bugs at some later time.