I'm using multiprocessing.Queue to pass numpy arrays of float64 between python processes. This is working fine, but I'm worried it may not be as efficient as it could be.
According to the documentation of multiprocessing, objects placed on the Queue will be pickled. calling pickle on a numpy array results in a text representation of the data, so null bytes get replaced by the string "\\x00".
>>> pickle.dumps(numpy.zeros(10))
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I10\ntp6\ncnumpy\ndtype\np7\n(S'f8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
I'm concerned that this means my arrays are being expensively converted into something 4x the original size and then converted back in the other process.
Is there a way to pass the data through the queue in a raw unaltered form?
I know about shared memory, but if that is the correct solution, I'm not sure how to build a queue on top of it.
Thanks!
The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.
import numpy
import cPickle as pickle
N = 1000
a0 = pickle.dumps(numpy.zeros(N))
a1 = pickle.dumps(numpy.zeros(N), protocol=-1)
print "a0", len(a0) # 32155
print "a1", len(a1) # 8133
Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).
On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.
Share Large, Read-Only Numpy Array Between Multiprocessing Processes
That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.
Related
I would like to know how does Python actually manages memory allocation for ndarrays.
I loaded a file that contains 32K floating value using numpy loadtxt, so the ndarray size should be 256KB data.
Actually, ndarray.nbytes gives the right size.
However, the memory occupation after loading data is increased by 2MB: I don't unserstand why this difference.
I'm not sure exactly how you measure memory occupation, but when looking at the memory footprint of your entire app there's a lot more that could be happening that causes these kind of memory occupation increases.
In this case, I suspect that the loadtxt function uses some buffering or otherwise copies the data which wasn't cleared yet by the GC.
But other things could be happening as well. Maybe the numpy back-end loads some extra stuff the first time it initialises a ndarray. Either way, you could only truly figure this stuff out by reading the numpy source could which is available freely on github. The implementation of loadtxt can be found here: https://github.com/numpy/numpy/blob/5b22ee427e17706e3b765cf6c65e924d89f3bfce/numpy/lib/npyio.py#L797
I am trying to run a Python (2.7) script with PyPy but I have encountered the following error:
TypeError: sys.getsizeof() is not implemented on PyPy.
A memory profiler using this function is most likely to give results
inconsistent with reality on PyPy. It would be possible to have
sys.getsizeof() return a number (with enough work), but that may or
may not represent how much memory the object uses. It doesn't even
make really sense to ask how much *one* object uses, in isolation
with the rest of the system. For example, instances have maps,
which are often shared across many instances; in this case the maps
would probably be ignored by an implementation of sys.getsizeof(),
but their overhead is important in some cases if they are many
instances with unique maps. Conversely, equal strings may share
their internal string data even if they are different objects---or
empty containers may share parts of their internals as long as they
are empty. Even stranger, some lists create objects as you read
them; if you try to estimate the size in memory of range(10**6) as
the sum of all items' size, that operation will by itself create one
million integer objects that never existed in the first place.
Now, I really need to check the size of one nested dict during the execution of the program, is there any alternative to sys.getsizeof() I can use in PyPy? If not, how would I check for the size of a nested object in PyPy?
Alternatively you can gauge the memory usage of your process using
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
As your program is executing, getrusage will give total memory consumption of the process in number of bytes or kilobytes. Using this information you can estimate the size of your data structures, and if you begin to use say 50% of your machine's total memory, then you can do something to handle it.
I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.
Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.
The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.
I am working on a python program which read a lot of images in batches (let's say 500 images) and store it in a numpy array.
Now it's single thread, and IO is very fast, the part which take a lot of time is creating numpy array and doing something on it.
By using multiprocessing module, I am able to read and create the array in other process. But I am having problem let the main thread access those data.
I have tried:
1: Using multiprocessing.queues: Very slow, I believe it's the pickle and unpickle waste a lot of time. Pickling and unpickling a large numpy array take quite some time.
2: Using Manager.list(): Faster than queues, but when try to access it in main thread, it 's still very slow. Even just iterate over the list and do nothing takes 2 seconds per item.
I don't understand why it take so much time.
Any suggestions ? Thanks.
Looks like I have to answer my own question.
The problem I was facing could be solved by using shared memory with numpy.
More details could be found at
Use numpy array in shared memory for multiprocessing
The idea is basically create the shared memory in the main process, and assign the memory to a numpy array. Later in other process, you can either read from it or write to it.
This approach works pretty well for me, it speeds up my program by a factor of 10.
Because I am processing a large amount of data and pickling is not an option for me.
The most critical code is :
shared_arr = mp.Array(ctypes.c_double, N)
arr = tonumpyarray(shared_arr)
In a recent question I asked about the fastest way to convert a large numpy array to a delimited string. My reason for asking was because I wanted to take that plain text string and transmit it (over HTTP for instance) to clients written in other programming languages. A delimited string of numbers is obviously something that any client program can work with easily. However, it was suggested that because string conversion is slow, it would be faster on the Python side to do base64 encoding on the array and send it as binary. This is indeed faster.
My question now is, (1) how can I make sure my encoded numpy array will travel well to clients on different operating systems and different hardware, and (2) how do I decode the binary data on the client side.
For (1), my inclination is to do something like the following
import numpy as np
import base64
x = np.arange(100, dtype=np.float64)
base64.b64encode(x.tostring())
Is there anything else I need to do?
For (2), I would be happy to have an example in any programming language, where the goal is to take the numpy array of floats and turn them into a similar native data structure. Assume we have already done base64 decoding and have a byte array, and that we also know the numpy dtype, dimensions, and any other metadata which will be needed.
Thanks.
You should really look into OPeNDAP to simplify all aspects of scientific data networking. For Python, check out Pydap.
You can directly store your NumPy arrays into HDF5 format via h5py (or NetCDF), then stream the data to clients over HTTP using OPeNDAP.
For something a little lighter-weight than HDF (though admittedly also more ad-hoc), you could also use JSON:
import json
import numpy as np
x = np.arange(100, dtype=np.float64)
print json.dumps(dict(data=x.tostring(),
shape=x.shape,
dtype=str(x.dtype)))
This would free your clients from needing to install HDF wrappers, at the expense of having to deal with a nonstandard protocol for data exchange (and possibly also needing to install JSON bindings !).
The tradeoff would be up to you to evaluate for your situation.
I'd recommend using an existing data format for interchange of scientific data/arrays, such as NetCDF or HDF. In Python, you can use the PyNIO library which has numpy bindings, and there are several libraries for other languages. Both formats are built for handling large data and take care of language, machine representation problems, etc. They also work well with message passing, for example in parallel computing, so I suspect your use case is satisfied.
What the tostring method of numpy arrays does is basically give you a dump of the memory used by the array's data (not the object wrapper for Python, but just the data of the array.) This is similar to the struct stdlib module. Base64-encoding that string and sending it across should be quite good enough, although you may also need to send across the actual datatype used, as well as the dimensions if it's a multidimensional array, as you won't be able to tell those just from the data.
On the other side, how to read the data depends a little on the language. Most languages have a way of addressing such a block of memory as a particular type of array. For example, in C, you could simply base64-decode the string, assign it to (in the case of your example) a float64 * and index away. This doesn't give you any of the built-in safeguards and functions and other operations that numpy arrays have in Python, but that's because C is quite a different language in that respect.