Memory management in numpy arrays,python

Memory management in numpy arrays,python - python

I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.

Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.

The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.

Related

Transpose large numpy matrix on disk

I have a rather large rectangular (>1G rows, 1K columns) Fortran-style NumPy matrix, which I want to transpose to C-style.
So far, my approach has been relatively trivial with the following Rust snippet, using MMAPed slices of the source and destination matrix, where both original_matrix and target_matrix are MMAPPed PyArray2, with Rayon handling the parallelization.
Since the target_matrix has to be modified by multiple threads, I wrap it in an UnsafeCell.
let shared_target_matrix = std::cell::UnsafeCell::new(target_matrix);
original_matrix.as_ref().par_chunks(number_of_nodes).enumerate().for_each(|(j, feature)|{
feature.iter().copied().enumerate().for_each(|(i, feature_value)| unsafe {
*(shared_target_matrix.uget_mut([i, j])) = feature_value;
});
});
This approach transposes a matrix with shape (~1G, 100), ~120GB takes ~3 hours on an HDD disk. Transposing a (~1G, 1000), ~1200GB matrix does not scale linearly to 30 hours, as one may naively expect, but explode to several weeks. As it stands, I have managed to transpose roughly 100 features in 2 days, and it keeps slowing down.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Note on sequential and parallel approaches
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster (on a machine with 12 cores and 24 threads) than a sequential approach when transposing a matrix with shape (1G, 100). We are not sure why this is the case.
Note on using two HDDs
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.

While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster
This may be due to poor IO queue utilization. With an entirely sequential workload without prefetching you'll be alternating the device between working and idle. If you keep multiple operations in flight it'll be working all the time.
Check with iostat -x <interval>
But parallelism is a suboptimal way to achieve best utilization of a HDD because it'll likely cause more head-seeks than necessary.
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
This may be due to the operating system's write cache which means it can batch writes very efficiently and you're mostly bottlenecked on reads. Again, check with iostat.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Yes, if the underlying filesystem supports it you can use FIEMAP to get the physical layout of the data on disk and then optimize your read order to follow the physical layout rather than the logical layout. You can use the filefrag CLI tool to inspect the fragmentation data manually, but there are rust bindings for that ioctl so you can use it programmatically too.
Additionally you can use madvise(MADV_WILLNEED) to inform the kernel to prefetch data in the background for the next few loop iterations. For HDDs this should be ideally done in batches worth a few megabytes at a time. And the next batch should be issued when you're half-way through the current one.
Issuing them in batches minimizes syscall overhead and starting the next one half-way through ensures there's enough time left to actually complete the IO before you reach the end of the current one.
And since you'll be manually issuing prefetches in physical instead of logical order you can also disable the default readahead heuristics (which would be getting in the way) via madvise(MADV_RANDOM)
If you have enough free diskspace you could also try a simpler approach: defragmenting the file before operating on it. But even then you should still use madvise to ensure that there always are IO requests in flight.

Unable to allocate Numpy in Python

I am reading in a 15Gb .csv file using the read_csv() pandas function including the iterator/chunk functionality because I need a subset of the file of about 20%.
I am doing this in PyCharm where I set the max heap size to 18Gb (although I have 16Gb RAM) and the minimum allocated memory to half of the max heap size 9Gb. Throughout this process Pycharm indicates I am using around 100Mb to 200Mb of RAM, while the Windows Task Manager indicates I am using approximately 2.5Gb of RAM which includes both the PyCharm and Python processes. I have about 45% left of my memory in the task manager.
As far as I can see there is nothing that indicates that I am running out of memory. Still while reading in this data I get a Memory error which tells me:
MemoryError: Unable to allocate array with shape (4, 8193780) and data type float64
Is there someone that can clarify this for me? I would suspect that maybe the final dataframe is larger than my RAM can handle? That would be:
( 4 * 8193780 * 8 (float64) ) / (1024**3) < 1Gb
So the above also does not seem to be the problem, or am I missing something here?

I think you are using 15 Gb of memory just to read your file, since i guess the read_csv() function access the whole file even if you specified the chunk/iterator to use 20% percent of your file, excluding that you are runninig windows and pycharm which needs at least 1 Gb of memory, so adding all the things up then you are out of memory i guess.
But those are someways to face your problem.
Verify the dtype of your array, and try to find the best one for your purpose. For example you are using float64, consider whether float32 or even float16 might be appropriate.
Consider if your computation can be done on a subset of the data. This is called subsampling. Maybe using subsampling you get a good enough model (this may be the case for a clustering algorithm like Kmean).
You may search for out-of-core solutions. This may either be rethinking your algorithm (can you split the work), or trying a solution that does it transparently.

Python ndarray management

I would like to know how does Python actually manages memory allocation for ndarrays.
I loaded a file that contains 32K floating value using numpy loadtxt, so the ndarray size should be 256KB data.
Actually, ndarray.nbytes gives the right size.
However, the memory occupation after loading data is increased by 2MB: I don't unserstand why this difference.

I'm not sure exactly how you measure memory occupation, but when looking at the memory footprint of your entire app there's a lot more that could be happening that causes these kind of memory occupation increases.
In this case, I suspect that the loadtxt function uses some buffering or otherwise copies the data which wasn't cleared yet by the GC.
But other things could be happening as well. Maybe the numpy back-end loads some extra stuff the first time it initialises a ndarray. Either way, you could only truly figure this stuff out by reading the numpy source could which is available freely on github. The implementation of loadtxt can be found here: https://github.com/numpy/numpy/blob/5b22ee427e17706e3b765cf6c65e924d89f3bfce/numpy/lib/npyio.py#L797

How can I efficiently read and write files that are too large to fit in memory?

I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.
From reading other questions I know that memmap, PyTables and h5py are my best bets for handling this kind of data, and I am currently working with two memmaps; one for reading the vectors, the other for storing the matrix of cosine similarities.
Here is my code:
import numpy as np
import scipy.spatial.distance as dist
xdim = 200000
ydim = 100000
wmat = np.memmap('inputfile', dtype = 'd', mode = 'r', shape = (xdim,ydim))
dmat = np.memmap('outputfile', dtype = 'd', mode = 'readwrite', shape = (ydim,ydim))
for i in np.arange(ydim)):
for j in np.arange(i+1,ydim):
dmat[i,j] = dist.cosine(wmat[:,i],wmat[:,j])
dmat.flush()
Currently, htop reports that I am using 224G of VIRT memory, and 91.2G of RES memory which is climbing steadily. It seems to me as if, by the end of the process, the entire output matrix will be stored in memory, which is something I'm trying to avoid.
QUESTION:
Is this a correct usage of memmaps, am I writing to the output file in a memory efficient manner (by which I mean that only the necessary parts of the in- and output files i.e. dmat[i,j] and wmat[:,i/j], are stored in memory)?
If not, what did I do wrong, and how can I fix this?
Thanks for any advice you may have!
EDIT: I just realized that htop is reporting total system memory usage at 12G, so it seems it is working after all... anyone out there who can enlighten me? RES is now at 111G...
EDIT2: The memmap is created from a 1D array consisting of lots and lots of long decimals quite close to 0, which is shaped to the desired dimensions. The memmap then looks like this.
memmap([[ 9.83721223e-03, 4.42584107e-02, 9.85033578e-03, ...,
-2.30691545e-07, -1.65070799e-07, 5.99395837e-08],
[ 2.96711345e-04, -3.84307391e-04, 4.92968462e-07, ...,
-3.41317722e-08, 1.27959347e-09, 4.46846438e-08],
[ 1.64766260e-03, -1.47337747e-05, 7.43660202e-07, ...,
7.50395136e-08, -2.51943163e-09, 1.25393555e-07],
...,
[ -1.88709000e-04, -4.29454722e-06, 2.39720287e-08, ...,
-1.53058717e-08, 4.48678211e-03, 2.48127260e-07],
[ -3.34207882e-04, -4.60275148e-05, 3.36992876e-07, ...,
-2.30274532e-07, 2.51437794e-09, 1.25837564e-01],
[ 9.24923862e-04, -1.59552854e-03, 2.68354822e-07, ...,
-1.08862665e-05, 1.71283316e-07, 5.66851420e-01]])

In terms of memory usage, there's nothing particularly wrong with what you're doing at the moment. Memmapped arrays are handled at the level of the OS - data to be written is usually held in a temporary buffer, and only committed to disk when the OS deems it necessary. Your OS should never allow you to run out of physical memory before flushing the write buffer.
I'd advise against calling flush on every iteration since this defeats the purpose of letting your OS decide when to write to disk in order to maximise efficiency. At the moment you're only writing individual float values at a time.
In terms of IO and CPU efficiency, operating on a single line at a time is almost certainly suboptimal. Reads and writes are generally quicker for large, contiguous blocks of data, and likewise your calculation will probably be much faster if you can process many lines at once using vectorization. The general rule of thumb is to process as big a chunk of your array as will fit in memory (including any intermediate arrays that are created during your computation).
Here's an example showing how much you can speed up operations on memmapped arrays by processing them in appropriately-sized chunks.
Another thing that can make a huge difference is the memory layout of your input and output arrays. By default, np.memmap gives you a C-contiguous (row-major) array. Accessing wmat by column will therefore be very inefficient, since you're addressing non-adjacent locations on disk. You would be much better off if wmat was F-contiguous (column-major) on disk, or if you were accessing it by row.
The same general advice applies to using HDF5 instead of memmaps, although bear in mind that with HDF5 you will have to handle all the memory management yourself.

Memory maps are exactly what the name says: mappings of (virtual) disk sectors into memory pages. The memory is managed by the operating system on demand. If there is enough memory, the system keeps parts of the files in memory, maybe filling up the whole memory, if there is not enough left, the system may discard pages read from file or may swap them into swap space. Normally you can rely on the OS is as efficient as possible.

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.