numpy.memmap: bogus memory allocation - python

I have a python3 script that operates with numpy.memmap arrays. It writes an array to newly generated temporary file that is located in /tmp:
import numpy, tempfile
size = 2 ** 37 * 10
tmp = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'w+', shape = size)
array[0] = 666
array[size-1] = 777
del array
array2 = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
format(tmp.name, len(array2), array2[0], array2[size-1]))
while True:
pass
The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp, and the corresponding array still seems to be accessible. The output of the script is following:
File: /tmp/tmptjfwy8nr. Array size: 1374389534720. First cell value: 666. Last cell value: 777
The file really exists and is displayed as being 10T large:
$ ls -l /tmp/tmptjfwy8nr
-rw------- 1 user user 10995116277760 Dec 1 15:50 /tmp/tmptjfwy8nr
However, the whole size of /tmp is much smaller:
$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 235G 5.3G 218G 3% /
The process also is pretending to use 10T virtual memory, which is also not possible. The output of top command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31622 user 20 0 10.000t 16592 4600 R 100.0 0.0 0:45.63 python3
As far as I understand, this means that during the call of numpy.memmap the needed memory for the whole array is not allocated and therefore displayed file size is bogus. This in turn means that when I start to gradually fill the whole array with my data, at some point my program will crash or my data will be corrupted.
Indeed, if I introduce the following in my code:
for i in range(size):
array[i] = i
I get the error after a while:
Bus error (core dumped)
Therefore, the question: how to check at the beginning, if there is really enough memory for the data and then indeed reserve the space for the whole array?

There's nothing 'bogus' about the fact that you are generating 10 TB files
You are asking for arrays of size
2 ** 37 * 10 = 1374389534720 elements
A dtype of 'i8' means an 8 byte (64 bit) integer, therefore your final array will have a size of
1374389534720 * 8 = 10995116277760 bytes
or
10995116277760 / 1E12 = 10.99511627776 TB
If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?
Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.
For example, on my Linux machine I'm allowed to do something like this:
# I only have about 50GB of free space...
~$ df -h /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sdb1 ext4 459G 383G 53G 88% /
~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s
# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec 1 21:17 sparsefile
# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0 sparsefile
Try calling du -h on your np.memmap file after it has been initialized to see how much actual disk space it uses.
As you start actually writing data to your np.memmap file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error. This means that if you needed to write < 250GB of data to your np.memmap array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).
How is it possible for a process to use 10 TB of virtual memory?
When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.
How can you check whether you have enough disk space to store the full np.memmap array?
I'm assuming that you want to do this programmatically in Python.
Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is os.statvfs:
import os
def get_free_bytes(path='/'):
st = os.statvfs(path)
return st.f_bavail * st.f_bsize
print(get_free_bytes())
# 56224485376
Work out the size of your array in bytes:
import numpy as np
def check_asize_bytes(shape, dtype):
return np.prod(shape) * np.dtype(dtype).itemsize
print(check_asize_bytes((2 ** 37 * 10,), 'i8'))
# 10995116277760
Check whether 2. > 1.
Update: Is there a 'safe' way to allocate an np.memmap file, which guarantees that sufficient disk space is reserved to store the full array?
One possibility might be to use fallocate to pre-allocate the disk space, e.g.:
~$ fallocate -l 1G bigfile
~$ du -h bigfile
1.1G bigfile
You could call this from Python, for example using subprocess.check_call:
import subprocess
def fallocate(fname, length):
return subprocess.check_call(['fallocate', '-l', str(length), fname])
def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
nbytes = np.prod(shape) * np.dtype(dtype).itemsize
fallocate(fname, nbytes)
return np.memmap(fname, dtype, *args, shape=shape, **kwargs)
mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))
print(mmap.nbytes / 1E6)
# 8.388608
print(subprocess.check_output(['du', '-h', 'test.mmap']))
# 8.0M test.mmap
I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate Python module on PyPI that should work for any Posix-based OS.

Based on the answer of #ali_m I finally came to this solution:
# must be called with the argumant marking array size in GB
import sys, numpy, tempfile, subprocess
size = (2 ** 27) * int(sys.argv[1])
tmp_primary = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp_primary.name, dtype = 'i8', mode = 'w+', shape = size)
tmp = tempfile.NamedTemporaryFile('w+')
check = subprocess.Popen(['cp', '--sparse=never', tmp_primary.name, tmp.name])
stdout, stderr = check.communicate()
if stderr:
sys.stderr.write(stderr.decode('utf-8'))
sys.exit(1)
del array
tmp_primary.close()
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
array[0] = 666
array[size-1] = 777
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
format(tmp.name, len(array), array[0], array[size-1]))
while True:
pass
The idea is to copy initially generated sparse file to a new normal one. For this cp with the option --sparse=never is employed.
When the script is called with a manageable size parameter (say, 1 GB) the array is getting mapped to a non-sparse file. This is confirmed by the output of du -h command, which now shows ~1 GB size. If the memory is not enough, the scripts exits with the error:
cp: ‘/tmp/tmps_thxud2’: write failed: No space left on device

Related

Problems when I write np array to binary file, new file is only half of the original one

I am trying to remove top 24 lines of a raw file, so I opened the original raw file(let's call it raw1.raw) and converted it to nparray, then I initialized a new array and remove the top24 lines, but after writing new array to the new binary file(raw2.raw), I found raw2 is 15.2mb only while the original file raw1.raw is like 30.6mb, my code:
import numpy as np
import imageio
import rawpy
import cv2
def ave():
fd = open('raw1.raw', 'rb')
rows = 3000 #around 3000, not the real rows
cols = 5100 #around 5100, not the real cols
f = np.fromfile(fd, dtype=np.uint8,count=rows*cols)
I_array = f.reshape((rows, cols)) #notice row, column format
#print(I_array)
fd.close()
im = np.zeros((rows - 24 , cols))
for i in range (len(I_array) - 24):
for j in range(len(I_array[i])):
im[i][j] = I_array[i + 24][j]
#print(im)
newFile = open("raw2.raw", "wb")
im.astype('uint8').tofile(newFile)
newFile.close()
if __name__ == "__main__":
ave()
I tried to use im.astype('uint16') when write in the binary file, but the value would be wrong if I use uint16.
There must clearly be more data in your 'raw1.raw' file that you are not using. Are you sure that file wasn't created using 'uint16' data and you are just pulling out the first half as 'uint8' data? I just checked the writing of random data.
import os, numpy as np
x = np.random.randint(0,256,size=(3000,5100),dtype='uint8')
x.tofile(open('testfile.raw','w'))
print(os.stat('testfile.raw').st_size) #I get 15.3MB.
So, 'uint8' for a 3000 by 5100 clearly takes up 15.3MB. I don't know how you got 30+.
############################ EDIT #########
Just to add more clarification. Do you realize that dtype does nothing more than change the "view" of your data? It doesn't effect the actual data that is saved in memory. This also goes for data that you read from a file. Take for example:
import numpy as np
#The way to understand x, is that x is taking 12 bytes in memory and using
#that information to hold 3 values. The first 4 bytes are the first value,
#the second 4 bytes are the second, etc.
x = np.array([1,2,3],dtype='uint32')
#Change x to display those 12 bytes at 6 different values. Doing this does
#NOT change the data that the array is holding. You are only changing the
#'view' of the data.
x.dtype = 'uint16'
print(x)
In general (there are few special cases), changing the dtype doesn't change the underlying data. However, the conversion function .astype() does change the underlying data. If you have any array of 12 bytes viewed as 'int32' then running .astype('uint8') will take each entry (4 bytes) and covert it (known as casting) to a uint8 entry (1 byte). The new array will only have 3 bytes for the 3 entries. You can see this litterally:
x = np.array([1,2,3],dtype='uint32')
print(x.tobytes())
y = x.astype('uint8')
print(y.tobytes())
So, when we say that a file is 30mb, we mean that the file has (minus some header information) is 30,000,000 bytes which are exactly uint8s. 1 uint8 is 1 byte. If any array has 6000by5100 uint8s (bytes), then the array has 30,600,000 bytes of information in memory.
Likewise, if you read a file (DOES NOT MATTER THE FILE) and write np.fromfile(,dtype=np.uint8,count=15_300_000) then you told python to read EXACTLY 15_300_000 bytes (again 1 byte is 1 uint8) of information (15mb). If your file is 100mb, 40mb, or even 30mb, it would be completely irrelevant because you told python to only read the first 15mb of data.

Struct unpack MemoryError

I'm triyng to read a image binary file into RAM with struct unpack. Binary file has 120MB and every pixel is represented by 16 bits.
For presition purposes later in computation, I need to cast 16 bit data into float64 numpy array...
According to my computation, I need in RAM 524MB to read all data. My PC has 8GB RAM and 4GB in free so I think that's not the problem.
I read here Memory error in hgrecco's comment, maybe there is a struct unpack limit.
So here is an extra question: What's that limit? It's no specified in official documentation....
Here is the code:
PD: here nrows and ncols giving total image size is put as a default
parameter for simplicity:
def read_BIL_img(filename, nrows = 8196, ncols = 8000):
# Open and read entire BIL data into str type named "data"
fi = open(filename, "rb")
data = fi.read()
fi.close()
# Unpack all binary data into a flat tuple, accordint to a format defined.
# It's read unsigned short integer as in https://docs.python.org/2.7/library/struct.html#format-characters.
format = "=%dH" % (int(nrows*ncols),)
img_tuple = struct.unpack(format, data)
# Convert flat tuple img into a numpy array of nrows*ncols.
img_array = np.asarray(img_tuple).reshape((nrows, ncols))
return img_array.astype(float)
I have the following error:
img_tuple = struct.unpack(format, data)
MemoryError
PD 2: I'm using python 2.7 interpreter and 1.9.2 numpy version in windows 10 machine.

sharing gmpy2 multi-precision integer between processes without copying

Is it possible to share gmpy2 multiprecision integers (https://pypi.python.org/pypi/gmpy2) between processes (created by multiprocessing) without creating copies in memory?
Each integer has about 750,000 bits. The integers are not modified by the processes.
Thank you.
Update: Tested code is below.
I would try the following untested approach:
Create a memory mapped file using Python's mmap library.
Use gmpy2.to_binary() to convert a gmpy2.mpz instance into binary string.
Write both the length of the binary string and binary string itself into the memory mapped file. To allow for random access, you should begin every write at a multiple of a fixed value, say 94000 in your case.
Populate the memory mapped file with all your values.
Then in each process, use gmpy2.from_binary() to read the data from the memory mapped file.
You need to read both the length of the binary string and binary string itself. You should be able to pass a slice from the memory mapped file directly to gmpy2.from_binary().
I may be simpler to create a list of (start, end) values for the position of each byte string in the memory mapped file and then pass that list to each process.
Update: Here is some sample code that has been tested on Linux with Python 3.4.
import mmap
import struct
import multiprocessing as mp
import gmpy2
# Number of mpz integers to place in the memory buffer.
z_count = 40000
# Maximum number of bits in each integer.
z_bits = 750000
# Total number of bytes used to store each integer.
# Size is rounded up to a multiple of 4.
z_size = 4 + (((z_bits + 31) // 32) * 4)
def f(instance):
global mm
s = 0
for i in range(z_count):
mm.seek(i * z_size)
t = struct.unpack('i', mm.read(4))[0]
z = gmpy2.from_binary(mm.read(t))
s += z
print(instance, z % 123456789)
def main():
global mm
mm = mmap.mmap(-1, z_count * z_size)
rs = gmpy2.random_state(42)
for i in range(z_count):
z = gmpy2.mpz_urandomb(rs, z_bits)
b = gmpy2.to_binary(z)
mm.seek(i * z_size)
mm.write(struct.pack('i', len(b)))
mm.write(b)
ctx = mp.get_context('fork')
pool = ctx.Pool(4)
pool.map_async(f, range(4))
pool.close()
pool.join()
if __name__ == '__main__':
main()

Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?

My current approach is this:
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
with open(path, 'rb') as f:
for block in iter(lambda: f.read(1024*func.block_size, b''):
func.update(block)
return func.hexdigest()
It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 # 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?
EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).
EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(
EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.
Another solution (~-0.5 sec, slightly faster) is to use os.open():
def get_hash(path=PATH, hash_type='md5'):
func = getattr(hashlib, hash_type)()
f = os.open(path, (os.O_RDWR | os.O_BINARY))
for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
func.update(block)
os.close(f)
return func.hexdigest()
Note that these results are not final.
Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.
Using your first method required 21 seconds.
Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
Using the following function with a buffer size of 8096 required 17 seconds.
Using the following function with a buffer size of 32767 required 11 seconds.
Using the following function with a buffer size of 65536 required 8 seconds.
Using the following function with a buffer size of 131072 required 8 seconds.
Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
pts = time.process_time()
ats = time.time()
m = hashlib.md5()
with open(path, 'rb') as f:
b = f.read(size)
while len(b) > 0:
m.update(b)
b = f.read(size)
print("{0:.3f} s".format(time.process_time() - pts))
print("{0:.3f} s".format(time.time() - ats))
Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.
The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.

Creating random binary files

I'm trying to use python to create a random binary file. This is what I've got already:
f = open(filename,'wb')
for i in xrange(size_kb):
for ii in xrange(1024/4):
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
f.close()
But it's terribly slow (0.82 seconds for size_kb=1024 on my 3.9GHz SSD disk machine). A big bottleneck seems to be the random int generation (replacing the randint() with a 0 reduces running time from 0.82s to 0.14s).
Now I know there are more efficient ways of creating random data files (namely dd if=/dev/urandom) but I'm trying to figure this out for sake of curiosity... is there an obvious way to improve this?
IMHO - the following is completely redundant:
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
There's absolutely no need to use struct.pack, just do something like:
import os
fileSizeInBytes = 1024
with open('output_filename', 'wb') as fout:
fout.write(os.urandom(fileSizeInBytes)) # replace 1024 with a size in kilobytes if it is not unreasonably large
Then, if you need to re-use the file for reading integers, then struct.unpack then.
(my use case is generating a file for a unit test so I just need a
file that isn't identical with other generated files).
Another option is to just write a UUID4 to the file, but since I don't know the exact use case, I'm not sure that's viable.
The python code you should write completely depends on the way you intend to use the random binary file. If you just need a "rather good" randomness for multiple purposes, then the code of Jon Clements is probably the best.
However, on Linux OS at least, os.urandom relies on /dev/urandom, which is described in the Linux Kernel (drivers/char/random.c) as follows:
The /dev/urandom device [...] will return as many bytes as are
requested. As more and more random bytes are requested without giving
time for the entropy pool to recharge, this will result in random
numbers that are merely cryptographically strong. For many
applications, however, this is acceptable.
So the question is, is this acceptable for your application ? If you prefer a more secure RNG, you could read bytes on /dev/random instead. The main inconvenient of this device: it can block indefinitely if the Linux kernel is not able to gather enough entropy. There are also other cryptographically secure RNGs like EGD.
Alternatively, if your main concern is execution speed and if you just need some "light-randomness" for a Monte-Carlo method (i.e unpredictability doesn't matter, uniform distribution does), you could consider generate your random binary file once and use it many times, at least for development.
Here's a complete script based on accepted answer that creates random files.
import sys, os
def help(error: str = None) -> None:
if error and error != "help":
print("***",error,"\n\n",file=sys.stderr,sep=' ',end='');
sys.exit(1)
print("""\tCreates binary files with random content""", end='\n')
print("""Usage:""",)
print(os.path.split(__file__)[1], """ "name1" "1TB" "name2" "5kb"
Accepted units: MB, GB, KB, TB, B""")
sys.exit(2)
# https://stackoverflow.com/a/51253225/1077444
def convert_size_to_bytes(size_str):
"""Convert human filesizes to bytes.
ex: 1 tb, 1 kb, 1 mb, 1 pb, 1 eb, 1 zb, 3 yb
To reverse this, see hurry.filesize or the Django filesizeformat template
filter.
:param size_str: A human-readable string representing a file size, e.g.,
"22 megabytes".
:return: The number of bytes represented by the string.
"""
multipliers = {
'kilobyte': 1024,
'megabyte': 1024 ** 2,
'gigabyte': 1024 ** 3,
'terabyte': 1024 ** 4,
'petabyte': 1024 ** 5,
'exabyte': 1024 ** 6,
'zetabyte': 1024 ** 7,
'yottabyte': 1024 ** 8,
'kb': 1024,
'mb': 1024**2,
'gb': 1024**3,
'tb': 1024**4,
'pb': 1024**5,
'eb': 1024**6,
'zb': 1024**7,
'yb': 1024**8,
}
for suffix in multipliers:
size_str = size_str.lower().strip().strip('s')
if size_str.lower().endswith(suffix):
return int(float(size_str[0:-len(suffix)]) * multipliers[suffix])
else:
if size_str.endswith('b'):
size_str = size_str[0:-1]
elif size_str.endswith('byte'):
size_str = size_str[0:-4]
return int(size_str)
if __name__ == "__main__":
input = {} #{ file: byte_size }
if (len(sys.argv)-1) % 2 != 0:
print("-- Provide even number of arguments --")
print(f'--\tGot: {len(sys.argv)-1}: "' + r'" "'.join(sys.argv[1:]) +'"')
sys.exit(2)
elif len(sys.argv) == 1:
help()
try:
for file, size_str in zip(sys.argv[1::2], sys.argv[2::2]):
input[file] = convert_size_to_bytes(size_str)
except ValueError as ex:
print(f'Invalid size: "{size_str}"', file=sys.stderr)
sys.exit(1)
for file, size_bytes in input.items():
print(f"Writing: {file}")
#https://stackoverflow.com/a/14276423/1077444
with open(file, 'wb') as fout:
while( size_bytes > 0 ):
wrote = min(size_bytes, 1024) #chunk
fout.write(os.urandom(wrote))
size_bytes -= wrote

Categories

Resources