Loading large files into memory with Python

Loading large files into memory with Python - python

When doing work with large files and datasets (usually 1 or 2 gb+), the process is killed do to running out of RAM. What tools and methods are available to allow saving memory, while allowing the necessary functions, such as iteration over the entire file, and accessing and assigning other large variables. Due to the need to have access to the entire file in read mode, I am unsure of solutions for the given problem. Thanks for any help.
For reference, the project I am currently encountering this problem in is right here (dev branch).

Generally, you can use memory-mapped files so not to map a section of a virtual memory in a storage device. This enable you to operate on a use space of memory mapped that would not fit in RAM. Note that this is significantly slower than RAM though (there is no free lunch). You can use Numpy to do that quite transparently with numpy.memmap. Alternatively, there is mmap. For sake of performance, you can operate on chunks on read write them once from the memory-mapped section.

Related

Which one is faster in production? file on disk vs file in memory(StringIO,BytesIO)

I am converting a dictionary to pandas object with to_csv. I have both way of doing this
1 - by writing file in disk(with open statement)
2 - by writing in memory (StringIO,BytesIO)
I have used it in both way creating file in disk and using StringIO to convert to pandas object. I tried to read comparisons between these three, but bit confused which one is faster so i can use it in production to process tons of data.

Writing and reading from memory is fast. But keep in mind that you have tons of data. So storing all that in-memory might take up all your memory and might make the system slow or might throw errors due to Out of Memory. So, analyze and understand which all data to be put in memory and which all to be written to files.

In general - writing to RAM (memory) will be faster.
However, you might want to use Iterators (saving memory using iterators) if you have too much data, because your machine might will run out-of-memory, or just will write a lot to your SWAP file (in short - that's an "extension" of your RAM in your hard drive, you can read about it here), which will hurt your performance, a lot.
For benchmarking, if your code is pretty simple - I would recommenced using timeit, but there are even better resources for that, such as this one, from scipy

How can I mmap HDF5 data into multiple Python processes?

I am trying to load HDF5 data from a memory cache (memcached) or the network, and then query it (read only) from multiple Python processes, without making a separate copy of the whole data set. Intuitively I would like to mmap the image (as it would appear on disk) into the multiple processes, and then query it from Python.
I am finding this difficult to achieve, hence the question. Pointers/corrections appreciated.
Ideas I have explored so far
pytables - This looks the most promising, it supports rich interface for querying HDF5 data and it (unlike numpy) seems to work with the data without making a (process local) copy of the data. It even supports a method File.get_file_image() which would seem to get the file image. What I don't see how to construct a new File / FileNode from a memory image rather than a disk file.
h5py - Another way to get at HDF5 data, as with pytables, it seems to require a disk file. The option driver='core' looks promising, but I can't see how to provide an existing mmap'd region into it, rather than have it allocate its own.
numpy - A lower level approach, if I share my raw data via mmap, then I might be able to construct a numpy ndarray which can access this data. But the relevant constructor ndarray.__new__(buffer=...) says it will copy the data, and numpy views can only seem to be constructed from existing ndarrays, not raw buffers.
ctypes - Very lowest level approach (could possibly use multiprocessing's Value wrapper to help a little). If I use ctypes directly I can read my mmap'd data without issue, but I would lose all the structural information and help from numpy/pandas/pytables to query it.
Allocate disk space - I could just allocate a file, write out all the data, and then share it via pytables in all my processes. My understanding is this would be memory efficient because pytables doesn't copy (until required) and obviously the processes would share the OS disk cache of the underlying file image. My objection is it is ugly and brings disk I/O into what I would like to be a pure memory system.

I think the situation should be updated now.
If a disk file is desirable, Numpy now has a standard, dedicated ndarray subclass:
numpy.memmap
UPDATE:
After looked into the implementation of multiprocessing.sharedctypes (CPython 3.6.2 shared memory block allocation code), I found that it always creates tmp files to be mmaped, so is not really a file-less solution.
If only pure RAM based sharing is expected, some one has demoed it with multiprocessing.RawArray:
test of shared memory array / numpy integration

mmap + the core driver w/ H5py for in-memory read only access. I submitted a patch for H5py to work with file-images a while ago for scenarios like this. Unfortunately it got rejected because upstream didn't want to give users the ability to shoot themselves in the foot and safe buffer management (via the c buffer protocol Python 2.7 introduced) but this required changes HDF's side which I haven't gotten around to. Still, if this is important to you and you are careful and capable of building pyHDF yourself, take a look at the patch/pull request here

What is the advantage of setting zip_safe to True when packaging a Python project?

The setuptools documentation only states:
For maximum performance, Python packages are best installed as zip files. Not all packages, however, are capable of running in compressed form, because they may expect to be able to access either source code or data files as normal operating system files. So, setuptools can install your project as a zipfile or a directory, and its default choice is determined by the project's zip_safe flag (reference).
In practical terms, what is the performance benefit gained? Is it worth investigating if my projects are zip-safe, or are the benefits generally minimal?

Zip files take up less space on disk, which also means they're more quickly read from disk. Since most things are I/O bound, the overhead in decompressing the packaging may be less than the overhead in reading a larger file from disk. Moreover, it's likely that a single, small-ish zip file is stored sequentially on disk, while a collection of smaller files may be more spread out. On rotational media, this also increases read performance by cutting down the number of seeks. So you generally optimize your disk usage at the cost of some CPU time, which may dramatically improve your import and load times.

There are several advantages, in addition to the ones already mentioned.
Reading a single large .egg file (and unzipping it) may be significantly faster than loading multiple (potentially a lot of) smaller .py files, depending on the storage medium/filesystem on which it resides.
Some filesystem have a large block size (e.g., 1MB), which means that dealing with small files can be expensive. Even though your files are small (say, 10KB), you may actually be loading a 1MB block from disk when reading it. Typically, filesystems combine multiple small files in a large block to mitigate this a bit.
On filesystems where access to file metadata is slow (which sometimes happens with shared filesystems, like NFS), accessing a large amount of files may be very expensive too.
Of course, zipping the whole bunch also helps, since that means that less data will have to be read in total.
Long story short: it may matter a lot if your filesystem is more suited for a small amount of large files.

Is there a way to borrow hard drive memory as needed?

This might be a stupid question but if, for example, i'm working with very large arrays that take up 2.1 GB of RAM on my 2GB computer, is there a way to borrow the extra 0.1GB from my hard drive as needed?

Your operating system already does that (Windows, *nix). It's called Virtual Memory.

Yes. It is part of virtual memory, specifically swapping. Most modern operating systems do this without the programmer having to worry about it.
When physical RAM becomes exhausted, the harddisk is used as extra memory. This is very slow, because the access times for a harddisk (milliseconds) is in the order of a million times slower than that of DRAM (nanoseconds).
It would be advisable to increase your RAM, if possible.

As others have stated yes the systems already have virtual memory.
However you can take advantage of this in another way. You can use memory mapped files to allow the system to directly map the arrays to disk.
Using it this way as you write to memory (the arrays), the system will use it's virtual memory management system to use the disk. You might ask how this is different to to standard VMM that the OS will do? Well the advantage is that it won't use the standard swap space (page file in windows) and as such that space is 'memory' is availble for the the rest of the system to use.
You still have large resource usage, but you gain by freeing up swap space and in effect, borrow more virtual memory. The other advantages you get is that there is no duplication of data. ie. if you're loading large datasets then you just map the disk space to memory and vice versa for writes.

How to deserialize 1GB of objects into Python faster than cPickle?

We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?

Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...

Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.

I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.

Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.

Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?

I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.