How do I remove the memory limit on openmpi processes?

How do I remove the memory limit on openmpi processes? - python

I'm running a process with mpirun and 2 cores and it gets killed at the point when I'm mixing values between the two processes. Both processes use about 15% of the machines memory and even though the memory will increase when mixing, there should still be plenty of memory left. So I'm assuming that there is a limit on the amount of memory used for passing messages in between the processes. How do I find out what this limit is and how do I remove it?
The error message that I'm getting when mpirun dies is this:
File "Comm.pyx", line 864, in mpi4py.MPI.Comm.bcast (src/mpi4py.MPI.c:67787)
File "pickled.pxi", line 564, in mpi4py.MPI.PyMPI_bcast (src/mpi4py.MPI.c:31462)
File "pickled.pxi", line 93, in mpi4py.MPI._p_Pickle.alloc (src/mpi4py.MPI.c:26327)
SystemError: Negative size passed to PyBytes_FromStringAndSize
And this is the bit of the code that leads to the error:
sum_updates_j_k = numpy.zeros((self.col.J_total, self.K), dtype=numpy.float64))
comm.Reduce(self.updates_j_k, sum_updates_j_k, op=MPI.SUM)
sum_updates_j_k = comm.bcast(sum_updates_j_k, root=0)
The code usually works, it only runs into problems with larger amounts of data, which makes the size of the matrix that I'm exchanging between processes increase

The culprit is probably the following lines found in the code of PyMPI_bcast():
cdef int count = 0
...
if dosend: smsg = pickle.dump(obj, &buf, &count) # <----- (1)
with nogil: CHKERR( MPI_Bcast(&count, 1, MPI_INT, # <----- (2)
root, comm) )
cdef object rmsg = None
if dorecv and dosend: rmsg = smsg
elif dorecv: rmsg = pickle.alloc(&buf, count)
...
What happens here is that the object is first serialised at (1) using pickle.dump() and then the length of the pickled stream is broadcasted at (2).
There are two problems here and they both have to do with the fact that int is used for the length. The first problem is an integer cast inside pickle.dump and the other problem is that MPI_INT is used to transmit the length of the pickled stream. This limits the amount of data in your matrix to a certain size - namely the size that would result in a pickled object no bigger than 2 GiB (231-1 bytes). Any bigger object would result in an integer overflow and thus negative values in count.
This is clearly not an MPI issue but rather a bug in (or a feature of?) mpi4py.

I had the same problem with mpi4py recently. As pointed out by Hristo Iliev in his answer, it's a pickle problem.
This can be avoided by using the upper-case methods comm.Reduce(), comm.Bcast(), etc., which do not resort to pickle, as opposed to lower-case methods like comm.reduce(). As a bonus, upper case methods should be a bit faster as well.
Actually, you're already using comm.Reduce(), so I expect that switching to comm.Bcast() should solve your problem - it did for me.
NB: The syntax of upper-case methods is slightly different, but this tutorial can help you get started.
For example, instead of:
sum_updates_j_k = comm.bcast(sum_updates_j_k, root=0)
you would use:
comm.Bcast(sum_updates_j_k, root=0)

For such a case it is useful to have a function that can send numpy arrays in parts, e.g.:
from mpi4py import MPI
import math, numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
def bcast_array_obj(obj = None, dtype = numpy.float64, root = 0):
"""Function for broadcasting of a numpy array object"""
reporter = 0 if root > 0 else 1
if rank == root:
for exp in range(11):
parts = pow(2, exp)
err = False
part_len = math.ceil(len(obj) / parts)
for part in range(parts):
part_begin = part * part_len
part_end = min((part + 1) * part_len, len(obj))
try:
comm.bcast(obj[part_begin: part_end], root = root)
except:
err = True
err *= comm.recv(source = reporter, tag = 2)
if err:
break
if err:
continue
comm.bcast(None, root = root)
print('The array was successfully sent in {} part{}'.\
format(parts, 's' if parts > 1 else ''))
return
sys.stderr.write('Failed to send the array even in 1024 parts')
sys.stderr.flush()
else:
obj = numpy.zeros(0, dtype = dtype)
while True:
err = False
try:
part_obj = comm.bcast(root = root)
except:
err = True
obj = numpy.zeros(0, dtype = dtype)
if rank == reporter:
comm.send(err, dest = root, tag = 2)
if err:
continue
if type(part_obj) != type(None):
frags = len(obj)
obj.resize(frags + len(part_obj))
obj[frags: ] = part_obj
else:
break
return obj
This function automatically determines optimal number of parts to break the input array.
For example,
if rank != 0:
z = bcast_array_obj(root = 0)
else:
z = numpy.zeros(1000000000, dtype = numpy.float64)
bcast_array_obj(z, root = 0)
outputs
The array was successfully sent in 4 parts

Apparently this is an issue in MPI itself and not in MPI4py. The actual variable which holds the size of the data being communicated is a signed 32 bit integer which will overflow to a negative value for around 2GB of data.
Maximum amount of data that can be sent using MPI::Send
It's been raised as an issue with MPI4py previously as well here.

Related

How to fix this IO bound python operation on 12GB .bin file?

I'm reading this book Hands-On Machine Learning for Algorithmic Trading and I came across a script that is supposed to parse a large .bin binary file and convert it to .h5. This file consists of something called ITCH data, you can find the technical documentation of the data here. The script is very inefficient, it reads a 12GB(12952050754 bytes) file 2 bytes at a time which is ultra slow(might take up to 4 hours on some decent 4cpu GCP instance) which is not very surprising. You can find the whole notebook here.
My problem is I don't understand how this .bin file is being read I mean I don't see where is the necessity of reading the file 2 bytes at a time, I think there is a way to read at a large buffer size but I'm not sure how to do it, or even convert the script to c++ if after optimizing this script, it is still being slow which I can do if I understand the inner workings of this I/O process, does anyone have suggestions?
here's a link to the file source of ITCH data, you can find small files(300 mb or less) which are for less time periods if you need to experiment with the code.
The bottleneck:
with file_name.open('rb') as data:
while True:
# determine message size in bytes
message_size = int.from_bytes(data.read(2), byteorder='big', signed=False)
# get message type by reading first byte
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
# read & store message
record = data.read(message_size - 1)
message = message_fields[message_type]._make(unpack(fstring[message_type], record))
messages[message_type].append(message)
# deal with system events
if message_type == 'S':
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
print('\n', event_codes.get(message.event_code.decode('ascii'), 'Error'))
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}')
if message.event_code.decode('ascii') == 'C':
store_messages(messages)
break
message_count += 1
if message_count % 2.5e7 == 0:
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
d = format_time(time() - start)
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}\t{d}')
res = store_messages(messages)
if res == 1:
print(pd.Series(dict(message_type_counter)).sort_values())
break
messages.clear()
And here's the store_messages() function:
def store_messages(m):
"""Handle occasional storing of all messages"""
with pd.HDFStore(itch_store) as store:
for mtype, data in m.items():
# convert to DataFrame
data = pd.DataFrame(data)
# parse timestamp info
data.timestamp = data.timestamp.apply(int.from_bytes, byteorder='big')
data.timestamp = pd.to_timedelta(data.timestamp)
# apply alpha formatting
if mtype in alpha_formats.keys():
data = format_alpha(mtype, data)
s = alpha_length.get(mtype)
if s:
s = {c: s.get(c) for c in data.columns}
dc = ['stock_locate']
if m == 'R':
dc.append('stock')
try:
store.append(mtype,
data,
format='t',
min_itemsize=s,
data_columns=dc)
except Exception as e:
print(e)
print(mtype)
print(data.info())
print(pd.Series(list(m.keys())).value_counts())
data.to_csv('data.csv', index=False)
return 1
return 0

According to the code, file format looks like its 2 bytes of message size, one byte of message type and then n bytes of actual message (defined by the previously read message size).
Low hanging fruit to optimize this is to read 3 bytes first into list, convert [0:1] to message size int and [2] to message type and then read the message ..
To further eliminate amount of required reads, you could read a fixed amount of data from the file into a list of and start extracting from it. While extracting, keep a index of already processed bytes stored and once that index or index+amount of data to be read goes over the size of the list, you prepopulate the list. This could lead to huge memory requirements if not done properly thought..

How to determine the length of a NULL-terminated string with ctypes?

I am accessing a function through ctypes which returns a pointer to a NULL-terminated string (array/vector of chars). The memory is allocated by the function (not under my control). The problem is, it does not return any information on its length. What I came up with (and what works), loosely inspired by what I would do in C, looks a bit wacky:
import ctypes
def get_length_of_null_terminated_string(in_pointer):
datatype = ctypes.c_char
datatype_size = ctypes.sizeof(datatype)
terminator = b'\x00'
length = 0
char_pointer = ctypes.cast(in_pointer, ctypes.POINTER(datatype))
STRING_MAX = 1024
while True:
if char_pointer.contents.value == terminator:
break
if length > STRING_MAX:
raise
void_pointer = ctypes.cast(char_pointer, ctypes.c_void_p)
void_pointer.value += datatype_size
char_pointer = ctypes.cast(void_pointer, ctypes.POINTER(datatype))
length += 1
return length
def test():
test_string = b'Hello World!'
print('Actual length: %d' % len(test_string))
test_buffer = ctypes.create_string_buffer(test_string)
test_pointer = ctypes.cast(test_buffer, ctypes.c_void_p)
print('Measured length: %d' % get_length_of_null_terminated_string(test_pointer))
if __name__ == '__main__':
test()
Is there a better way of doing it?
In particular, I can not find a way to get rid of the two cast statements. It appears that I can only increment the address of a c_void_p object (through its value attribute) while the same seems to be impossible for a pointer to c_char.

Thanks #abarnert for suggesting this in the comments.
Working with ctypes.POINTER(ctypes.c_char) is sort of the wrong approach here. Casting my pointer to ctypes.c_char_p instead helps. Python's len method can simply be applied to the value attribute of an instance of c_char_p.
Along the lines of my original example, a solution would look like this:
def get_length_of_null_terminated_string_better(in_pointer):
return len(ctypes.cast(in_pointer, ctypes.c_char_p).value)

sharing gmpy2 multi-precision integer between processes without copying

Is it possible to share gmpy2 multiprecision integers (https://pypi.python.org/pypi/gmpy2) between processes (created by multiprocessing) without creating copies in memory?
Each integer has about 750,000 bits. The integers are not modified by the processes.
Thank you.

Update: Tested code is below.
I would try the following untested approach:
Create a memory mapped file using Python's mmap library.
Use gmpy2.to_binary() to convert a gmpy2.mpz instance into binary string.
Write both the length of the binary string and binary string itself into the memory mapped file. To allow for random access, you should begin every write at a multiple of a fixed value, say 94000 in your case.
Populate the memory mapped file with all your values.
Then in each process, use gmpy2.from_binary() to read the data from the memory mapped file.
You need to read both the length of the binary string and binary string itself. You should be able to pass a slice from the memory mapped file directly to gmpy2.from_binary().
I may be simpler to create a list of (start, end) values for the position of each byte string in the memory mapped file and then pass that list to each process.
Update: Here is some sample code that has been tested on Linux with Python 3.4.
import mmap
import struct
import multiprocessing as mp
import gmpy2
# Number of mpz integers to place in the memory buffer.
z_count = 40000
# Maximum number of bits in each integer.
z_bits = 750000
# Total number of bytes used to store each integer.
# Size is rounded up to a multiple of 4.
z_size = 4 + (((z_bits + 31) // 32) * 4)
def f(instance):
global mm
s = 0
for i in range(z_count):
mm.seek(i * z_size)
t = struct.unpack('i', mm.read(4))[0]
z = gmpy2.from_binary(mm.read(t))
s += z
print(instance, z % 123456789)
def main():
global mm
mm = mmap.mmap(-1, z_count * z_size)
rs = gmpy2.random_state(42)
for i in range(z_count):
z = gmpy2.mpz_urandomb(rs, z_bits)
b = gmpy2.to_binary(z)
mm.seek(i * z_size)
mm.write(struct.pack('i', len(b)))
mm.write(b)
ctx = mp.get_context('fork')
pool = ctx.Pool(4)
pool.map_async(f, range(4))
pool.close()
pool.join()
if __name__ == '__main__':
main()

Segmentation fault using mpi4py

I am using mpi4py to spread a processing task over a cluster of cores.
My code looks like this:
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
'''Perform processing operations with each processor returning
two arrays of equal size, array1 and array2'''
all_data1 = comm.gather(array1, root = 0)
all_data2 = comm.gather(array2, root = 0)
This is returning the following error:
SystemError: Negative size passed to PyString_FromStringAndSize
I believe this error means that the array of data stored in all_data1 exceeds the maximum size of an array in Python, which is quite possible.
I tried doing it in smaller pieces, as follows:
comm.isend(array1, dest = 0, tag = rank+1)
comm.isend(array2, dest = 0, tag = rank+2)
if rank == 0:
for proc in xrange(size):
partial_array1 = comm.irecv(source = proc, tag = proc+1)
partial_array2 = comm.irecv(source = proc, tag = proc+2)
but this is returning the following error.
[node10:20210] *** Process received signal ***
[node10:20210] Signal: Segmentation fault (11)
[node10:20210] Signal code: Address not mapped (1)
[node10:20210] Failing at address: 0x2319982b
followed by a whole load of unintelligible path-like information and a final message:
mpirun noticed that process rank 0 with PID 0 on node node10 exited on signal 11 (Segmentation fault).
This seems to happen regardless of how many processors I am using.
For similar questions in C the solution seems to be subtly changing the way the arguments in the recv call are parsed. With Python the syntax is different so I would be grateful if someone could give some clarity to why this error is appearing and how to fix it.

I managed to resolve the problem I was having by doing the following.
if rank != 0:
comm.Isend([array1, MPI.FLOAT], dest = 0, tag = 77)
# Non-blocking send; allows code to continue before data is received.
if rank == 0:
final_array1 = array1
for proc in xrange(1,size):
partial_array1 = np.empty(len(array1), dtype = float)
comm.Recv([partial_array1, MPI.FLOAT], source = proc, tag = 77)
# A blocking receive is necessary here to avoid a Segfault.
final_array1 += partial_array1
if rank != 0:
comm.Isend([array2, MPI.FLOAT], dest = 0, tag = 135)
if rank == 0:
final_array2 = array2
for proc in xrange(1,size):
partial_array2 = np.empty(len(array2), dtype = float)
comm.Recv([partial_array2, MPI.FLOAT], source = proc, tag = 135)
final_array2 += partial_array2
comm.barrier() # This barrier call resolves the Segfault.
if rank == 0:
return final_array1, final_array2
else:
return None

Creating random binary files

I'm trying to use python to create a random binary file. This is what I've got already:
f = open(filename,'wb')
for i in xrange(size_kb):
for ii in xrange(1024/4):
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
f.close()
But it's terribly slow (0.82 seconds for size_kb=1024 on my 3.9GHz SSD disk machine). A big bottleneck seems to be the random int generation (replacing the randint() with a 0 reduces running time from 0.82s to 0.14s).
Now I know there are more efficient ways of creating random data files (namely dd if=/dev/urandom) but I'm trying to figure this out for sake of curiosity... is there an obvious way to improve this?

IMHO - the following is completely redundant:
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
There's absolutely no need to use struct.pack, just do something like:
import os
fileSizeInBytes = 1024
with open('output_filename', 'wb') as fout:
fout.write(os.urandom(fileSizeInBytes)) # replace 1024 with a size in kilobytes if it is not unreasonably large
Then, if you need to re-use the file for reading integers, then struct.unpack then.
(my use case is generating a file for a unit test so I just need a
file that isn't identical with other generated files).
Another option is to just write a UUID4 to the file, but since I don't know the exact use case, I'm not sure that's viable.

The python code you should write completely depends on the way you intend to use the random binary file. If you just need a "rather good" randomness for multiple purposes, then the code of Jon Clements is probably the best.
However, on Linux OS at least, os.urandom relies on /dev/urandom, which is described in the Linux Kernel (drivers/char/random.c) as follows:
The /dev/urandom device [...] will return as many bytes as are
requested. As more and more random bytes are requested without giving
time for the entropy pool to recharge, this will result in random
numbers that are merely cryptographically strong. For many
applications, however, this is acceptable.
So the question is, is this acceptable for your application ? If you prefer a more secure RNG, you could read bytes on /dev/random instead. The main inconvenient of this device: it can block indefinitely if the Linux kernel is not able to gather enough entropy. There are also other cryptographically secure RNGs like EGD.
Alternatively, if your main concern is execution speed and if you just need some "light-randomness" for a Monte-Carlo method (i.e unpredictability doesn't matter, uniform distribution does), you could consider generate your random binary file once and use it many times, at least for development.

Here's a complete script based on accepted answer that creates random files.
import sys, os
def help(error: str = None) -> None:
if error and error != "help":
print("***",error,"\n\n",file=sys.stderr,sep=' ',end='');
sys.exit(1)
print("""\tCreates binary files with random content""", end='\n')
print("""Usage:""",)
print(os.path.split(__file__)[1], """ "name1" "1TB" "name2" "5kb"
Accepted units: MB, GB, KB, TB, B""")
sys.exit(2)
# https://stackoverflow.com/a/51253225/1077444
def convert_size_to_bytes(size_str):
"""Convert human filesizes to bytes.
ex: 1 tb, 1 kb, 1 mb, 1 pb, 1 eb, 1 zb, 3 yb
To reverse this, see hurry.filesize or the Django filesizeformat template
filter.
:param size_str: A human-readable string representing a file size, e.g.,
"22 megabytes".
:return: The number of bytes represented by the string.
"""
multipliers = {
'kilobyte': 1024,
'megabyte': 1024 ** 2,
'gigabyte': 1024 ** 3,
'terabyte': 1024 ** 4,
'petabyte': 1024 ** 5,
'exabyte': 1024 ** 6,
'zetabyte': 1024 ** 7,
'yottabyte': 1024 ** 8,
'kb': 1024,
'mb': 1024**2,
'gb': 1024**3,
'tb': 1024**4,
'pb': 1024**5,
'eb': 1024**6,
'zb': 1024**7,
'yb': 1024**8,
}
for suffix in multipliers:
size_str = size_str.lower().strip().strip('s')
if size_str.lower().endswith(suffix):
return int(float(size_str[0:-len(suffix)]) * multipliers[suffix])
else:
if size_str.endswith('b'):
size_str = size_str[0:-1]
elif size_str.endswith('byte'):
size_str = size_str[0:-4]
return int(size_str)
if __name__ == "__main__":
input = {} #{ file: byte_size }
if (len(sys.argv)-1) % 2 != 0:
print("-- Provide even number of arguments --")
print(f'--\tGot: {len(sys.argv)-1}: "' + r'" "'.join(sys.argv[1:]) +'"')
sys.exit(2)
elif len(sys.argv) == 1:
help()
try:
for file, size_str in zip(sys.argv[1::2], sys.argv[2::2]):
input[file] = convert_size_to_bytes(size_str)
except ValueError as ex:
print(f'Invalid size: "{size_str}"', file=sys.stderr)
sys.exit(1)
for file, size_bytes in input.items():
print(f"Writing: {file}")
#https://stackoverflow.com/a/14276423/1077444
with open(file, 'wb') as fout:
while( size_bytes > 0 ):
wrote = min(size_bytes, 1024) #chunk
fout.write(os.urandom(wrote))
size_bytes -= wrote

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I remove the memory limit on openmpi processes? - python

Related

How to fix this IO bound python operation on 12GB .bin file?

How to determine the length of a NULL-terminated string with ctypes?

sharing gmpy2 multi-precision integer between processes without copying

Segmentation fault using mpi4py

Creating random binary files

Categories

Resources