I have a processes from several servers that send data to my local port 2222 via udp every second.
I want to read this data and write it to shared memory so there can be other processes to read the data from shared memory and do things to it.
I've been reading about mmap and it seems I have to use a file... which I can't seem to understand why.
I have an a.py that reads the data from the socket, but how can I write it to an shm?
Once once it's written, I need to write b.py, c.py, d.py, etc., to read the very same shm and do things to it.
Any help or snippet of codes would greatly help.
First, note that what you're trying to build will require more than just shared memory: it's all well if a.py writes to shared memory, but how will b.py know when the memory is ready and can be read from? All in all, it is often simpler to solve this problem by connecting the multiple processes not via shared memory, but through some other mechanism.
(The reason for why mmap usually needs a file name is that it needs a name to connect the several processes. Indeed, if a.py and b.py both call mmap(), how would the system know that these two processes are asking for memory to be shared between them, and not some unrelated z.py? Because they both mmaped the same file. There are also Linux-specific extensions to give a name that doesn't correspond to a file name, but it's more a hack IMHO.)
Maybe the most basic alternative mechanism is pipes: they are usually connected with the help of the shell when the programs are started. That's how the following works (on Linux/Unix): python a.py | python b.py. Any output that a.py sends goes to the pipe, whose other end is the input for b.py. You'd write a.py so that it listens to the UDP socket and writes the data to stdout, and b.py so that it reads from stdin to process the data received. If the data needs to go to several processes, you can use e.g. named pipes, which have a nice (but Bash-specific) syntax: python a.py >(python b.py) >(python c.py) will start a.py with two arguments, which are names of pseudo-files that can be opened and written to. Whatever is written to the first pseudo-file goes as input for b.py, and similarly what is written to the second pseudo-file goes as input for c.py.
mmap doesn't take a file name but rather a file descriptor. It performs the so-called memory mapping, i.e. it associates pages in the virtual memory space of the process with portions of the file-like object, represented by the file descriptor. This is a very powerful operation since it allows you:
to access the content of a file simply as an array in memory;
to access the memory of special I/O hardware, e.g. the buffers of a sound card or the framebuffer of a graphics adapter (this is possible since file desciptors in Unix are abstractions and they can also refer to device nodes instead of regular files);
to share memory between processes by performing shared maps of the same object.
The old pre-POSIX way to use shared memory on Unix was to use the System V IPC shared memory. First a shared memory segment had to be created with shmget(2) and then attached to the process with shmat(2). SysV shared memory segments (as well as other IPC objects) have no names but rather numeric IDs, so the special hash function ftok(3) is provided, which converts the combination of a pathname string and a project ID integer into a numeric key ID, but collisions are possible.
The modern POSIX way to use shared memory is to open a file-like memory object with shm_open(2), resize it to the desired size with ftruncate(2) and then to mmap(2) it. Memory-mapping in this case acts like the shmat(2) call from the SysV IPC API and truncation is necessary since shm_open(2) creates objects with an initial size of zero.
(these are part of the C API; what Python modules provide is more or less thin wrappers around those calls and often have nearly the same signature)
It is also possible to get shared memory by memory mapping the same regular file in all processes that need to share memory. As a matter of fact, Linux implements the POSIX shared memory operations by creating files on a special tmpfs file system. The tmpfs driver implements very lightweight memory mapping by directly mapping the pages that hold the file content into the address space of the process that executes mmap(2). Since tmpfs behaves as a normal filesystem, you can examine its content using ls, cat and other shell tools. You can even create shared memory objects this way or modify the content of the existent ones. The difference between a file in tmpfs and a regular filesystem file is that the latter is persisted to storage media (hard disk, network storage, flash drive, etc.) and occasionally changes are flushed to this storage media while the former lives entirely in RAM. Solaris also provides similar RAM-based filesystem, also called tmpfs.
In modern operating systems memory mapping is used extensively. Executable files are memory-mapped in order to supply the content of those pages, that hold the executable code and the static data. Also shared libraries are memory-mapped. This saves physical memory since these mappings are shared, e.g. the same physical memory that holds the content of an executable file or a shared library is mapped in the virtual memory space of each process.
Related
What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.
You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.
Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.
Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference
If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.
The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.
Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.
Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.
Given a numpy.memmap object created with mode='r' (i.e. read-only), is there a way to force it to purge all loaded pages out of physical RAM, without deleting the object itself?
In other words, I'd like the reference to the memmap instance to remain valid, but all physical memory that's being used to cache the on-disk data to be uncommitted. Any views onto to the memmap array must also remain valid.
I am hoping to use this as a diagnostic tool, to help separate "real" memory requirements of a script from "transient" requirements induced by the use of memmap.
I'm using Python 2.7 on RedHat.
If you run "pmap SCRIPT-PID", the "real" memory shows as "[ anon ]" blocks, and all memory-mapped files show up with the file name in the last column.
Purging the pages is possible at C level, if you manage to get ahold of the pointer to the beginning of the mapping and call madvise(ptr, length, MADV_DONTNEED) on it, but it's going to be cludgy.
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".