I have a large list of files (10-1000) that I want to lazily load the contents into memory on access. I have enough memory to load the contents of each individual file into memory, but not enough to load the contents of all the files into memory simultaneously.
(To be more specific, these are pickle files around 1-8GB each, containing all sorts of data that I want to use in a Jupyter Notebook. But the solution should ideally be general to any type of file, and not be limited to Jupyter Notebooks)
In the end, I want to be able to write code similar to this
a = file_contents[0]['property1'] # transparently loads file0 into memory
b = file_contents[0]['property2'] # keeps file0 in memory, no disk access
c = file_contents[5]['property1'] # unloads file0, then loads file5 into memory
d = file_contents[0]['property3'] # unloads file5, then loads file0 back into memory
where the interface transparently loads and unloads files as necessary to get the requested data structures, while keeping memory usage reasonable. Ideally the interface should also look like a normal array or dictionary access.
I can of course write my own class, but I'm interested if there is already a more robust implementation or better idiom for this kind of behavior
Related
What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.
You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.
I am working with some large audio files (~500MB), with a lot of processing and conversion involved. One of the steps involves writing a file, sending it though a network, then reading the file at arrival, then saving the file based on some logic.
As the network part is irrelevant for me, I wonder what is faster or more efficient, reading and writing actual files, or io file like object.
Also, how significant is the performance difference, if at all.
My intuition would say io object would be more efficient, but I do not know how either process works.
io file-like object has been created to avoid creating temporary files that you don't want to store, just to be able to pass to other modules and "fool" them into believing that they're actual file handles (there are limitations but for most usages it's okay)
So yes, using a io.BytesIO object will be faster, even with a SSD drive, reading/writing to RAM wins.
class io.BytesIO([initial_bytes])
A stream implementation using an in-memory bytes buffer.
Now if the data is very big, you're going to be out of memory or swap mechanism will occur. So there's a limit to the amount of data you can store in memory (I remember that old audio editing software were able to do "direct-to-disk" for that very reason: memory was limited at the time, and it was not possible to store several minutes of audio data in memory)
So I'm using Google Cloud data Lab and I use the %%storage read command to read in a large file (2,000,000 rows) Into the text variable and then I have to process it into a pandas dataframe using BytesIO eg df_new=pd.read_csv(BytesIO(text))
So now I don't need the text Variable or its contents around, (all further processing is done on df_new, how can I delete it (text) and free up memory (I sure don't need two copies of a 2 million record dataset hanging around...)
Use del followed by forced garbage collection.
import gc
# Remove text variable
del text
# Force gc collection - this not actually necessary, but may be useful.
gc.collect()
Note that you may not see process size decreasing and memory returning to OS, depending on memory allocator used (depends on OS, core libraries used and python compilation options).
I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.
I have a processes from several servers that send data to my local port 2222 via udp every second.
I want to read this data and write it to shared memory so there can be other processes to read the data from shared memory and do things to it.
I've been reading about mmap and it seems I have to use a file... which I can't seem to understand why.
I have an a.py that reads the data from the socket, but how can I write it to an shm?
Once once it's written, I need to write b.py, c.py, d.py, etc., to read the very same shm and do things to it.
Any help or snippet of codes would greatly help.
First, note that what you're trying to build will require more than just shared memory: it's all well if a.py writes to shared memory, but how will b.py know when the memory is ready and can be read from? All in all, it is often simpler to solve this problem by connecting the multiple processes not via shared memory, but through some other mechanism.
(The reason for why mmap usually needs a file name is that it needs a name to connect the several processes. Indeed, if a.py and b.py both call mmap(), how would the system know that these two processes are asking for memory to be shared between them, and not some unrelated z.py? Because they both mmaped the same file. There are also Linux-specific extensions to give a name that doesn't correspond to a file name, but it's more a hack IMHO.)
Maybe the most basic alternative mechanism is pipes: they are usually connected with the help of the shell when the programs are started. That's how the following works (on Linux/Unix): python a.py | python b.py. Any output that a.py sends goes to the pipe, whose other end is the input for b.py. You'd write a.py so that it listens to the UDP socket and writes the data to stdout, and b.py so that it reads from stdin to process the data received. If the data needs to go to several processes, you can use e.g. named pipes, which have a nice (but Bash-specific) syntax: python a.py >(python b.py) >(python c.py) will start a.py with two arguments, which are names of pseudo-files that can be opened and written to. Whatever is written to the first pseudo-file goes as input for b.py, and similarly what is written to the second pseudo-file goes as input for c.py.
mmap doesn't take a file name but rather a file descriptor. It performs the so-called memory mapping, i.e. it associates pages in the virtual memory space of the process with portions of the file-like object, represented by the file descriptor. This is a very powerful operation since it allows you:
to access the content of a file simply as an array in memory;
to access the memory of special I/O hardware, e.g. the buffers of a sound card or the framebuffer of a graphics adapter (this is possible since file desciptors in Unix are abstractions and they can also refer to device nodes instead of regular files);
to share memory between processes by performing shared maps of the same object.
The old pre-POSIX way to use shared memory on Unix was to use the System V IPC shared memory. First a shared memory segment had to be created with shmget(2) and then attached to the process with shmat(2). SysV shared memory segments (as well as other IPC objects) have no names but rather numeric IDs, so the special hash function ftok(3) is provided, which converts the combination of a pathname string and a project ID integer into a numeric key ID, but collisions are possible.
The modern POSIX way to use shared memory is to open a file-like memory object with shm_open(2), resize it to the desired size with ftruncate(2) and then to mmap(2) it. Memory-mapping in this case acts like the shmat(2) call from the SysV IPC API and truncation is necessary since shm_open(2) creates objects with an initial size of zero.
(these are part of the C API; what Python modules provide is more or less thin wrappers around those calls and often have nearly the same signature)
It is also possible to get shared memory by memory mapping the same regular file in all processes that need to share memory. As a matter of fact, Linux implements the POSIX shared memory operations by creating files on a special tmpfs file system. The tmpfs driver implements very lightweight memory mapping by directly mapping the pages that hold the file content into the address space of the process that executes mmap(2). Since tmpfs behaves as a normal filesystem, you can examine its content using ls, cat and other shell tools. You can even create shared memory objects this way or modify the content of the existent ones. The difference between a file in tmpfs and a regular filesystem file is that the latter is persisted to storage media (hard disk, network storage, flash drive, etc.) and occasionally changes are flushed to this storage media while the former lives entirely in RAM. Solaris also provides similar RAM-based filesystem, also called tmpfs.
In modern operating systems memory mapping is used extensively. Executable files are memory-mapped in order to supply the content of those pages, that hold the executable code and the static data. Also shared libraries are memory-mapped. This saves physical memory since these mappings are shared, e.g. the same physical memory that holds the content of an executable file or a shared library is mapped in the virtual memory space of each process.