load and operate on matrices bigger than RAM - python - numpy - pandas

load and operate on matrices bigger than RAM - python - numpy - pandas - python

my tasks:
load from the database matrices whose dimension is bigger than my
RAM by using (pandas.read_sql(...) - database is postresql)
operate on the numpy representation of such matrices (bigger than my RAM) using numpy
the problem: I get a memory error when even loading the data from the database.
my temporary quick and dirty solution: loop over chunks of the aforementioned data (so importing parts of the data at a time) thus allowing RAM to handle the workload. The issue at play here is speed related. time is significantly higher and before delving into Cython optimization and the like, I wanted to know whether there were some solutions (either in the forms of data structures like using the library shelving or the HDF5 format) to solve the issue

Related

Memory management in numpy arrays,python

I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.

Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.

The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.

Memory Errors in Python

I created a non-sparse matrix in Python with the shape 156.100.000 x 513 and therefore 8.007.930.000 entries. The entries are float32 numbers. I obviously have troubles now using the data within my 8GB RAM system.
What are common approaches to address this problem? I'm not sure how to use the data e.g. chunk-wise since all those Numpy and Scipy functions always operate on the whole dataset. Sparse matrices won't work because of the data structure and I also have troubles to understand if HDF5 is really suited towards my problem (which is use a big dataset with standard Numpy-functions without crashing my memory)

How to tell pytables the amount of RAM to be used (using it for paging)?

I am storing several numpy arrays in a pytables file. Each individual array (size ~1MB - 100MB) fits into RAM but not all (N ~10 - 1000) arrays together fit.
In the application I operate repeatedly on these arrays, also changing their shapes etc. So I want to use pytables to swap currently unneeded arrays to disk and reload them when needed (paging). The swapping is supposed to work on a "Least Recently Used" basis.
How can I tell pytables how much RAM it can use?
I tried playing around with parameters.NODE_CACHE_SLOTS but it had basically no effect. In a test script, I am storing ~200 random arrays of shape (~1000,~1000) in a table. No matter what I chose for some of the parameters like NODE_CACHE_SLOTS, the used RAM stayed the same -- about 80MB, while several GB would be available.
Especially in cases, where all nodes fit into the RAM, programs using pytables would not require any disk I/O and would, hence, of course be much faster. In general, one wants to exploit the available RAM.
[Of course it is also interesting if you know a better option than pytables for such paging purposes.]

What are the options to to compute statistical models on out of memory data sets in python?

I'm referring to sub hadoop size data, but bigger than ram.
Must these be coded by hand?

I'd try pytables, it's based on HDF5 and numpy, so you can use the same good statistical packages in Python, which are mostly based on numpy in some manner, while not having to put everything in memory
http://www.pytables.org/moin/MainFeatures
* Unlimited datasets size
Allows working with tables and/or arrays with a very large number of rows (up to 2**63), i.e. that don't fit in memory.

Python - storing integer array to disk for efficient retrieval

I have a large integer array that I need to store in a file, what is the most efficient way so I can have quick retrieval speed? I'm not concerned with efficiency of writing to disk, but reading only
I am wondering if there is a good solution other than json and pickle?

JSON/pickle are very low efficiency solutions as they require at best several memory copies to get your data in or out.
Keep you data binary if you want the best efficiency. The pure python approach would involve using struct.unpack, however this is a little kludgy as you still need a memory copy.
Even better is something like numpy.memmap which directly maps your file to a numpy array. Very fast, very memory efficient. Problem solved. You can also write your file using the same approach.

msgpack will probably beat json in terms of performance in loading data. Or, at least, msgpack beats json in my tests in loading many large files. Yet another possibility is to try HDF5 for Python:
HDF5 is an open-source library and file format for storing large
amounts of numerical data, originally developed at NCSA. It is widely
used in the scientific community for everything from NASA’s Earth
Observing System to the storage of data from laboratory experiments
and simulations. Over the past few years, HDF5 has rapidly emerged as
the de-facto standard technology in Python to store large numerical
datasets.
In your case I would go for HDF5.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.