I would like to know how does Python actually manages memory allocation for ndarrays.
I loaded a file that contains 32K floating value using numpy loadtxt, so the ndarray size should be 256KB data.
Actually, ndarray.nbytes gives the right size.
However, the memory occupation after loading data is increased by 2MB: I don't unserstand why this difference.
I'm not sure exactly how you measure memory occupation, but when looking at the memory footprint of your entire app there's a lot more that could be happening that causes these kind of memory occupation increases.
In this case, I suspect that the loadtxt function uses some buffering or otherwise copies the data which wasn't cleared yet by the GC.
But other things could be happening as well. Maybe the numpy back-end loads some extra stuff the first time it initialises a ndarray. Either way, you could only truly figure this stuff out by reading the numpy source could which is available freely on github. The implementation of loadtxt can be found here: https://github.com/numpy/numpy/blob/5b22ee427e17706e3b765cf6c65e924d89f3bfce/numpy/lib/npyio.py#L797
Related
I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.
Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.
The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.
I'm currently embedding Python in my C++ program using boost/python in order to use matplotlib. Now I'm stuck at a point where I have to construct a large data structure, let's say a dense 10000x10000 matrix of doubles. I want to plot columns of that matrix and I figured that i have multiple options to do so:
Iterating and copying every value into a numpy array --> I don't want to do that for an obvious reason which is doubled memory consumption
Iterating and exporting every value into a file than importing it in python --> I could do that completely without boost/python and I don't think this is a nice way
Allocate and store the matrix in Python and just update the values from C++ --> But as stated here it's not a good idea to switch back and forth between the Python interpreter and my C++ program
Somehow expose the matrix to python without having to copy it --> All I can find on that matter is about extending Python with C++ classes and not embedding
Which of these is the best option concerning performance and of course memory consumption or is there an even better way of doing that kind of task.
To prevent copying in Boost.Python, one can either:
Use policies to return internal references
Allocate on the free store and use policies to have Python manage the object
Allocate the Python object then extract a reference to the array within C++
Use a smart pointer to share ownership between C++ and Python
If the matrix has a C-style contiguous memory layout, then consider using the Numpy C-API. The PyArray_SimpleNewFromData() function can be used to create an ndarray object thats wraps memory that has been allocated elsewhere. This would allow one to expose the data to Python without requiring copying or transferring each element between the languages. The how to extend documentation is a great resource for dealing with the Numpy C-API:
Sometimes, you want to wrap memory allocated elsewhere into an ndarray object for downstream use. This routine makes it straightforward to do that. [...] A new reference to an ndarray is returned, but the ndarray will not own its data. When this ndarray is deallocated, the pointer will not be freed.
[...]
If you want the memory to be freed as soon as the ndarray is deallocated then simply set the OWNDATA flag on the returned ndarray.
Also, while the plotting function may create copies of the array, it can do so within the C-API, allowing it to take advantage of the memory layout.
If performance is a concern, it may be worth considering the plotting itself:
taking a sample of the data and plotting it may be sufficient depending on the data distribution
using a raster based backend, such as Agg, will often out perform vector based backends on large datasets
benchmarking other tools that are designed for large data, such as Vispy
Altough Tanner's answer brought me a big step forward, I ended up using Boost.NumPy, an inofficial extension to Boost.Python that can easily be added. It wraps around the NumPy C API and makes it more save and easier to use.
I'm using multiprocessing.Queue to pass numpy arrays of float64 between python processes. This is working fine, but I'm worried it may not be as efficient as it could be.
According to the documentation of multiprocessing, objects placed on the Queue will be pickled. calling pickle on a numpy array results in a text representation of the data, so null bytes get replaced by the string "\\x00".
>>> pickle.dumps(numpy.zeros(10))
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I10\ntp6\ncnumpy\ndtype\np7\n(S'f8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
I'm concerned that this means my arrays are being expensively converted into something 4x the original size and then converted back in the other process.
Is there a way to pass the data through the queue in a raw unaltered form?
I know about shared memory, but if that is the correct solution, I'm not sure how to build a queue on top of it.
Thanks!
The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.
import numpy
import cPickle as pickle
N = 1000
a0 = pickle.dumps(numpy.zeros(N))
a1 = pickle.dumps(numpy.zeros(N), protocol=-1)
print "a0", len(a0) # 32155
print "a1", len(a1) # 8133
Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).
On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.
Share Large, Read-Only Numpy Array Between Multiprocessing Processes
That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.
I am working on a python program which read a lot of images in batches (let's say 500 images) and store it in a numpy array.
Now it's single thread, and IO is very fast, the part which take a lot of time is creating numpy array and doing something on it.
By using multiprocessing module, I am able to read and create the array in other process. But I am having problem let the main thread access those data.
I have tried:
1: Using multiprocessing.queues: Very slow, I believe it's the pickle and unpickle waste a lot of time. Pickling and unpickling a large numpy array take quite some time.
2: Using Manager.list(): Faster than queues, but when try to access it in main thread, it 's still very slow. Even just iterate over the list and do nothing takes 2 seconds per item.
I don't understand why it take so much time.
Any suggestions ? Thanks.
Looks like I have to answer my own question.
The problem I was facing could be solved by using shared memory with numpy.
More details could be found at
Use numpy array in shared memory for multiprocessing
The idea is basically create the shared memory in the main process, and assign the memory to a numpy array. Later in other process, you can either read from it or write to it.
This approach works pretty well for me, it speeds up my program by a factor of 10.
Because I am processing a large amount of data and pickling is not an option for me.
The most critical code is :
shared_arr = mp.Array(ctypes.c_double, N)
arr = tonumpyarray(shared_arr)
I am trying to debug a memory problem with my large Python application. Most of the memory is in numpy arrays managed by Python classes, so Heapy etc. are useless, since they do not account for the memory in the numpy arrays. So I tried to manually track the memory usage using the MacOSX (10.7.5) Activity Monitor (or top if you will). I noticed the following weird behavior. On a normal python interpreter shell (2.7.3):
import numpy as np # 1.7.1
# Activity Monitor: 12.8 MB
a = np.zeros((1000, 1000, 17)) # a "large" array
# 142.5 MB
del a
# 12.8 MB (so far so good, the array got freed)
a = np.zeros((1000, 1000, 16)) # a "small" array
# 134.9 MB
del a
# 134.9 MB (the system didn't get back the memory)
import gc
gc.collect()
# 134.9 MB
No matter what I do, the memory footprint of the Python session will never go below 134.9 MB again. So my question is:
Why are the resources of arrays larger than 1000x1000x17x8 bytes (found empirically on my system) properly given back to the system, while the memory of smaller arrays appears to be stuck with the Python interpreter forever?
This does appear to ratchet up, since in my real-world applications, I end up with over 2 GB of memory I can never get back from the Python interpreter. Is this intended behavior that Python reserves more and more memory depending on usage history? If yes, then Activity Monitor is just as useless as Heapy for my case. Is there anything out there that is not useless?
Reading from Numpy's policy for releasing memory it seems like numpy does not have any special handling of memory allocation/deallocation. It simply calls free() when the reference count goes to zero. In fact it's pretty easy to replicate the issue with any built-in python object. The problem lies at the OS level.
Nathaniel Smith has written an explanation of what is happening in one of his replies in the linked thread:
In general, processes can request memory from the OS, but they cannot
give it back. At the C level, if you call free(), then what actually
happens is that the memory management library in your process makes a
note for itself that that memory is not used, and may return it from a
future malloc(), but from the OS's point of view it is still
"allocated". (And python uses another similar system on top for
malloc()/free(), but this doesn't really change anything.) So the OS
memory usage you see is generally a "high water mark", the maximum
amount of memory that your process ever needed.
The exception is that for large single allocations (e.g. if you create
a multi-megabyte array), a different mechanism is used. Such large
memory allocations can be released back to the OS. So it might
specifically be the non-numpy parts of your program that are producing
the issues you see.
So, it seems like there is no general solution to the problem .Allocating many small objects will lead to a "high memory usage" as profiled by the tools, even thou it will be reused when needed, while allocating big objects wont show big memory usage after deallocation because memory is reclaimed by the OS.
You can verify this allocating built-in python objects:
In [1]: a = [[0] * 100 for _ in range(1000000)]
In [2]: del a
After this code I can see that memory is not reclaimed, while doing:
In [1]: a = [[0] * 10000 for _ in range(10000)]
In [2]: del a
the memory is reclaimed.
To avoid memory problems you should either allocate big arrays and work with them(maybe use views to "simulate" small arrays?), or try to avoid having many small arrays at the same time. If you have some loop that creates small objects you might explicitly deallocate objects not needed at every iteration instead of doing this only at the end.
I believe Python Memory Management gives good insights on how memory is managed in python. Note that, on top of the "OS problem", python adds another layer to manage memory arenas, which can contribute to high memory usage with small objects.