Purging numpy.memmap

Purging numpy.memmap - python

Given a numpy.memmap object created with mode='r' (i.e. read-only), is there a way to force it to purge all loaded pages out of physical RAM, without deleting the object itself?
In other words, I'd like the reference to the memmap instance to remain valid, but all physical memory that's being used to cache the on-disk data to be uncommitted. Any views onto to the memmap array must also remain valid.
I am hoping to use this as a diagnostic tool, to help separate "real" memory requirements of a script from "transient" requirements induced by the use of memmap.
I'm using Python 2.7 on RedHat.

If you run "pmap SCRIPT-PID", the "real" memory shows as "[ anon ]" blocks, and all memory-mapped files show up with the file name in the last column.
Purging the pages is possible at C level, if you manage to get ahold of the pointer to the beginning of the mapping and call madvise(ptr, length, MADV_DONTNEED) on it, but it's going to be cludgy.

Related

Does Python's "append" file write mode only write new bytes, or does it re-write the entire file as well?

Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.

Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference

Fine control over h5py buffering

I have some data in memory that I want to store in a HDF file.
My data are not huge (<100 MB, so they fit in memory very comfortably), so for performance it seems to make sense to keep them there. At the same time, I also want to store it on disk. It is not critical that the two are always exactly in sync, as long as they are both valid (i.e. not corrupted), and that I can trigger a synchronization manually.
I could just keep my data in a separate container in memory, and shovel it into an HDF object on demand. If possible I would like to avoid writing this layer. It would require me to keep track of what parts have been changed, and selectively update those. I was hoping HDF would take care of that for me.
I know about the driver='core' with backing store functionality, but it AFAICT, it only syncs the backing store when closing the file. I can flush the file, but does that guarantee to write the object to storage?
From looking at the HDF5 source code, it seems that the answer is yes. But I'd like to hear a confirmation.
Bonus question: Is driver='core' actually faster than normal filesystem back-ends? What do I need to look out for?

What the H5Fflush command does is request to the file system to transfer all buffers to the file.
The documentation has a specific note about it:
HDF5 does not possess full control over buffering. H5Fflush flushes
the internal HDF5 buffers then asks the operating system (the OS) to
flush the system buffers for the open files. After that, the OS is
responsible for ensuring that the data is actually flushed to disk.
In practice, I have noticed that I can use most of the time read the data from a HDF5 file that has been flushed (even if the process was subsequently killed) but this is not guaranteed by HDF5: there is no safety in relying on the flush operation to have a valid HDF5 file as further operations (on the metadata, for instance) can corrupt the file is the process is then interrupted. You have to close the file completely to have this consistency.

If you need consistency and avoid corrupted hdf5 files, you may like to:
1) use write-ahead-log, use append logs to write what's being added/updated each time, no need to write to hdf5 this moment.
2) periodically, or at the time you need to shutdown, you replay the logs to apply them one by one, write to the hdf5 file
3) if your process down during 1), you won't lose data, after you startup next time, just replay the logs and writes them to hdf5 file
4) if your process is down during 2), you will not lose data, just remove the corrupted hdf5 file, replay the logs and write it again.

how to write to shared memory in python from stream?

I have a processes from several servers that send data to my local port 2222 via udp every second.
I want to read this data and write it to shared memory so there can be other processes to read the data from shared memory and do things to it.
I've been reading about mmap and it seems I have to use a file... which I can't seem to understand why.
I have an a.py that reads the data from the socket, but how can I write it to an shm?
Once once it's written, I need to write b.py, c.py, d.py, etc., to read the very same shm and do things to it.
Any help or snippet of codes would greatly help.

First, note that what you're trying to build will require more than just shared memory: it's all well if a.py writes to shared memory, but how will b.py know when the memory is ready and can be read from? All in all, it is often simpler to solve this problem by connecting the multiple processes not via shared memory, but through some other mechanism.
(The reason for why mmap usually needs a file name is that it needs a name to connect the several processes. Indeed, if a.py and b.py both call mmap(), how would the system know that these two processes are asking for memory to be shared between them, and not some unrelated z.py? Because they both mmaped the same file. There are also Linux-specific extensions to give a name that doesn't correspond to a file name, but it's more a hack IMHO.)
Maybe the most basic alternative mechanism is pipes: they are usually connected with the help of the shell when the programs are started. That's how the following works (on Linux/Unix): python a.py | python b.py. Any output that a.py sends goes to the pipe, whose other end is the input for b.py. You'd write a.py so that it listens to the UDP socket and writes the data to stdout, and b.py so that it reads from stdin to process the data received. If the data needs to go to several processes, you can use e.g. named pipes, which have a nice (but Bash-specific) syntax: python a.py >(python b.py) >(python c.py) will start a.py with two arguments, which are names of pseudo-files that can be opened and written to. Whatever is written to the first pseudo-file goes as input for b.py, and similarly what is written to the second pseudo-file goes as input for c.py.

mmap doesn't take a file name but rather a file descriptor. It performs the so-called memory mapping, i.e. it associates pages in the virtual memory space of the process with portions of the file-like object, represented by the file descriptor. This is a very powerful operation since it allows you:
to access the content of a file simply as an array in memory;
to access the memory of special I/O hardware, e.g. the buffers of a sound card or the framebuffer of a graphics adapter (this is possible since file desciptors in Unix are abstractions and they can also refer to device nodes instead of regular files);
to share memory between processes by performing shared maps of the same object.
The old pre-POSIX way to use shared memory on Unix was to use the System V IPC shared memory. First a shared memory segment had to be created with shmget(2) and then attached to the process with shmat(2). SysV shared memory segments (as well as other IPC objects) have no names but rather numeric IDs, so the special hash function ftok(3) is provided, which converts the combination of a pathname string and a project ID integer into a numeric key ID, but collisions are possible.
The modern POSIX way to use shared memory is to open a file-like memory object with shm_open(2), resize it to the desired size with ftruncate(2) and then to mmap(2) it. Memory-mapping in this case acts like the shmat(2) call from the SysV IPC API and truncation is necessary since shm_open(2) creates objects with an initial size of zero.
(these are part of the C API; what Python modules provide is more or less thin wrappers around those calls and often have nearly the same signature)
It is also possible to get shared memory by memory mapping the same regular file in all processes that need to share memory. As a matter of fact, Linux implements the POSIX shared memory operations by creating files on a special tmpfs file system. The tmpfs driver implements very lightweight memory mapping by directly mapping the pages that hold the file content into the address space of the process that executes mmap(2). Since tmpfs behaves as a normal filesystem, you can examine its content using ls, cat and other shell tools. You can even create shared memory objects this way or modify the content of the existent ones. The difference between a file in tmpfs and a regular filesystem file is that the latter is persisted to storage media (hard disk, network storage, flash drive, etc.) and occasionally changes are flushed to this storage media while the former lives entirely in RAM. Solaris also provides similar RAM-based filesystem, also called tmpfs.
In modern operating systems memory mapping is used extensively. Executable files are memory-mapped in order to supply the content of those pages, that hold the executable code and the static data. Also shared libraries are memory-mapped. This saves physical memory since these mappings are shared, e.g. the same physical memory that holds the content of an executable file or a shared library is mapped in the virtual memory space of each process.

Loading dictionary object causing memory spike

I have a dictionary object with about 60,000 keys that I cache and access in my Django view. The view provides basic search functionality where I look for a search term in the dictionary like so:
projects_map = cache.get('projects_map')
projects_map.get('search term')
However, just grabbing the cached object (in line 1) causes a a giant spike in memory usage on the server - upwards of 100MBs sometimes - and the memory isn't released even after the values are returned and the template rendered.
How can I keep the memory from jacking up like this? Also, I've tried explicitly deleting the object after I grab the value but even that doesn't release the memory spike.
Any help is greatly appreciated.
Update: Solution I ultimately implemented
I decided to implement my own indexing table in which I store the keys and their pickled value. Now, instead of using get() on a dictionary, I use:
ProjectsIndex.objects.get(index_key=<search term>)
and unpickle the value. This seems to take care of the memory issue as I'm no longer loading a giant object into memory. It adds another small query to the page but that's about it. Seems to be the perfect solution...for now.

..what about using some appropriate service for caching, such as redis or memcached instead of loading the huge object in memory python-side? This way, you'll even have the ability to scale on extra machines, should the dictionary grow more..
Anyways, the 100MB memory contain all the data + hash index + misc. overhead; I noticed myself the other day that many times memory doesn't get deallocated until you quit the Python process (I filled up couple gigs of memory from the Python interpreter, loading a huge json object.. :)); it would be interesting if anybody has a solution for that..
Update: caching with very few memory
Your options with only 512MB ram are:
Use redis, and have a look here http://redis.io/topics/memory-optimization (but I suspect 512MB isn't enough, even optimizing)
Use a separate machine (or a cluster of, since both memcached and redis support sharding) with way more ram to keep the cache
Use the database cache backend, much slower but less memory-consuming, as it saves everything on the disk
Use filesystem cache (although I don't see the point of preferring this over database cache)
and, in the latter two cases, try splitting up your objects, so that you never retrieve megabytes of objects from the cache at once.
Update: lazy dict spanning over multiple cache keys
You can replace your cached dict with something like this; this way, you can continue treating it as you would with a normal dictionary, but data will be loaded from cache only when you really need it.
from django.core.cache import cache
from UserDict import DictMixin
class LazyCachedDict(DictMixin):
def __init__(self, key_prefix):
self.key_prefix = key_prefix
def __getitem__(self, name):
return cache.get('%s:%s' % (self.key_prefix, name))
def __setitem__(self, name, value):
return cache.set('%s:%s' % (self.key_prefix, name), value)
def __delitem__(self, name):
return cache.delete('%s:%s' % (self.key_prefix, name))
def has_key(self, name):
return cache.has_key(name)
def keys():
## Just fill the gap, as the cache object doesn't provide
## a method to list cache keys..
return []
And then replace this:
projects_map = cache.get('projects_map')
projects_map.get('search term')
with:
projects_map = LazyCachedDict('projects_map')
projects_map.get('search term')

I don't know how Windows works, but in Linux a process practically cannot return memory to system. It's because a process address space is contiguous and the only available system call to increase memory is brk(), which just increases a pointer which marks last address available to a process.
All allocators which applications use (malloc etc.) are defined in user-space as a library. They operate on a byte-blocks level and use brk() to increase internal memory-pool only. In a running application this memory pool is cluttered with requested blocks. The only possibility to return memory to the system is when the last part of a pool has no blocks used (very unlikely that this will be of significant size, because even simple applications allocate and deallocate thousands of objects).
So a bloat caused by a memory spike will stay till the end. Solutions:
avoid the spike by optimizing memory usage, even if caused by temporary objects (eg: process a file line by line instead of reading whole contents at once)
put the cache in another process (memcached, as suggested in the first answer)
use a serialized dictionary (gdbm) or some other storage detached from process' private memory (mmap, shared memory)

If get on particular key is the only operation you perform, why not to keep all keys separately in cache? That way all entries will end up in separate files and django will be able to access them quickly.
The update will be much more painfull of course, but you can abstract it nicely. First thing I can think of is some cache key prefix.
The code could look like cache.get('{prefix}search_term') then.
Edit:
We're trying to solve wrong problem here. You don't need caching. The data gets updated, not dumped (after 5 mins or so).
You need to create a database table with all your entries.
If you don't have access to any database server from your setting, try to use sqlite. It's file based and should serve your purpose well.

Deleting an arbitrary chunk of a file

What is the most efficient way to delete an arbitrary chunk of a file, given the start and end offsets? I'd prefer to use Python, but I can fall back to C if I have to.
Say the file is this
..............xxxxxxxx----------------
I want to remove a chunk of it:
..............[xxxxxxxx]----------------
After the operation it should become:
..............----------------
Reading the whole thing into memory and manipulating it in memory is not a feasible option.

The best performance will almost invariably be obtained by writing a new version of the file and then having it atomically write the old version, because filesystems are strongly optimized for such sequential access, and so is the underlying hardware (with the possible exception of some of the newest SSDs, but, even then, it's an iffy proposition). In addition, this avoids destroying data in the case of a system crash at any time -- you're left with either the old version of the file intact, or the new one in its place. Since every system could always crash at any time (and by Murphy's Law, it will choose the most unfortunate moment;-), integrity of data is generally considered very important (often data is more valuable than the system on which it's kept -- hence, "mirroring" RAID solutions to ensure against disk crashes losing precious data;-).
If you accept this sane approach, the general idea is: open old file for reading, new one for writing (creation); copy N1 bytes over from the old file to the new one; then skip N2 bytes of the old file; then copy the rest over; close both files; atomically rename new to old. (Windows apparently has no "atomic rename" system call usable from Python -- to keep integrity in that case, instead of the atomic rename, you'd do three step: rename old file to a backup name, rename new file to old, delete backup-named file -- in case of system crash during the second one of these three very fast operations, one rename is all it will take to restore data integrity).
N1 and N2, of course, are the two parameters saying where the deleted piece starts, and how long it is. For the part about opening the files, with open('old.dat', 'rb') as oldf: and with open('NEWold.dat', 'wb') as newf: statements, nested into each other, are clearly best (the rest of the code until the rename step must be nested in both of them of course).
For the "copy the rest over" step, shutil.copyfileobj is best (be sure to specify a buffer length that's comfortably going to fit in your available RAM, but a large one will tend to give better performance). The "skip" step is clearly just a seek on the oldf open-for-reading file object. For copying exactly N1 bytes from oldf to newf, there is no direct support in Python's standard library, so you have to write your own, e.g:
def copyN1(oldf, newf, N1, buflen=1024*1024):
while N1:
newf.write(oldf.read(min(N1, buflen)))
N1 -= buflen

I'd suggest memory mapping. Though it is actually manipulating file in memory it is more efficient then plain reading of the whole file into memory.
Well, you have to manipulate the file contents in-memory one way or another as there's no system call for such operation neither in *nix nor in Win (at least not that I'm aware of).

Try mmaping the file. This won't necessarily read it all into memory at once.
If you really want to do it by hand, choose some chunk size and do back-and-forth reads and writes. But the seeks are going to kill you...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.