Synchronization when reading a file using multiprocessing in Python - python

I have a python function that reads random snippets from a large file and does some processing on it. I want the processing to happen in multiple processes and so make use of multiprocessing. I open the file (in binary mode) in the parent process and pass the file descriptor to each child process then use a multiprocessing.Lock() to synchronize access to the file. With a single worker things work as expected, but with more workers, even with the lock, the file reads will randomly return bad data (usually a bit from one part of the file and a bit from the another part of the file). In addition, the position within the file (as returned by file.tell()) will often get messed up. This all suggests a basic race condition accessing the descriptor, but my understanding is the multiprocessing.Lock() should prevent concurrent access to it. Does file.seek() and/or file.read() have some kind of asynchronous operations that don't get contained within the lock/unlock barriers? What is going here?
An easy workaround is to have each process open the file individually and get its own file descriptor (I've confirmed this does work), but I'd like to understand what I'm missing. Opening the file in text mode also prevents the issue from occurring, but doesn't work for my use case and doesn't explain what is happening in the binary case.
I've run the following reproducer on a number of Linux systems and OS X and on various local and remote file systems. I always get quite a few bad file positions and at least a couple of checksum errors. I know the read isn't guaranteed to read the full amount of data requested, but I've confirmed that is not what is happening here and omitted that code in an effort to keep things concise.
import argparse
import multiprocessing
import random
import string
def worker(worker, args):
rng = random.Random(1234 + worker)
for i in range(args.count):
block = rng.randrange(args.blockcount)
start = block * args.blocksize
with args.lock:
args.fd.seek(start)
data = args.fd.read(args.blocksize)
pos = args.fd.tell()
if pos != start + args.blocksize:
print(i, "bad file position", start, start + args.blocksize, pos)
cksm = sum(data)
if cksm != args.cksms[block]:
print(i, "bad checksum", cksm, args.cksms[block])
args = argparse.Namespace()
args.file = '/tmp/text'
args.count = 1000
args.blocksize = 1000
args.blockcount = args.count
args.filesize = args.blocksize * args.blockcount
args.num_workers = 4
args.cksms = multiprocessing.Array('i', [0] * args.blockcount)
with open(args.file, 'w') as f:
for i in range(args.blockcount):
data = ''.join(random.choice(string.ascii_lowercase) for x in range(args.blocksize))
args.cksms[i] = sum(data.encode())
f.write(data)
args.fd = open(args.file, 'rb')
args.lock = multiprocessing.Lock()
procs = []
for i in range(args.num_workers):
p = multiprocessing.Process(target=worker, args=(i, args))
procs.append(p)
p.start()
Example output:
$ python test.py
158 bad file position 969000 970000 741000
223 bad file position 908000 909000 13000
232 bad file position 679000 680000 960000
263 bad file position 959000 960000 205000
390 bad file position 771000 772000 36000
410 bad file position 148000 149000 42000
441 bad file position 677000 678000 21000
459 bad file position 143000 144000 636000
505 bad file position 579000 580000 731000
505 bad checksum 109372 109889
532 bad file position 962000 963000 243000
494 bad file position 418000 419000 2000
569 bad file position 266000 267000 991000
752 bad file position 732000 733000 264000
840 bad file position 801000 802000 933000
799 bad file position 332000 333000 989000
866 bad file position 150000 151000 248000
866 bad checksum 109116 109375
887 bad file position 39000 40000 974000
937 bad file position 18000 19000 938000
969 bad file position 20000 21000 24000
953 bad file position 542000 543000 767000
977 bad file position 694000 695000 782000

This seems to be caused by buffering: using open(args.file, 'rb', buffering=0) I can't reproduce anymore.
https://docs.python.org/3/library/functions.html#open
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off [...] When no buffering argument is given, the default buffering policy works as follows: [...] Binary files are buffered in fixed-size chunks; the size of the buffer [...] will typically be 4096 or 8192 bytes long. [...]

I've checked, only using multiprocessing.Lock (without buffering = 0), still met the bad data. with both multiprocessing.Lock and buffering=0, all things goes well

Related

How to fix this IO bound python operation on 12GB .bin file?

I'm reading this book Hands-On Machine Learning for Algorithmic Trading and I came across a script that is supposed to parse a large .bin binary file and convert it to .h5. This file consists of something called ITCH data, you can find the technical documentation of the data here. The script is very inefficient, it reads a 12GB(12952050754 bytes) file 2 bytes at a time which is ultra slow(might take up to 4 hours on some decent 4cpu GCP instance) which is not very surprising. You can find the whole notebook here.
My problem is I don't understand how this .bin file is being read I mean I don't see where is the necessity of reading the file 2 bytes at a time, I think there is a way to read at a large buffer size but I'm not sure how to do it, or even convert the script to c++ if after optimizing this script, it is still being slow which I can do if I understand the inner workings of this I/O process, does anyone have suggestions?
here's a link to the file source of ITCH data, you can find small files(300 mb or less) which are for less time periods if you need to experiment with the code.
The bottleneck:
with file_name.open('rb') as data:
while True:
# determine message size in bytes
message_size = int.from_bytes(data.read(2), byteorder='big', signed=False)
# get message type by reading first byte
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
# read & store message
record = data.read(message_size - 1)
message = message_fields[message_type]._make(unpack(fstring[message_type], record))
messages[message_type].append(message)
# deal with system events
if message_type == 'S':
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
print('\n', event_codes.get(message.event_code.decode('ascii'), 'Error'))
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}')
if message.event_code.decode('ascii') == 'C':
store_messages(messages)
break
message_count += 1
if message_count % 2.5e7 == 0:
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
d = format_time(time() - start)
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}\t{d}')
res = store_messages(messages)
if res == 1:
print(pd.Series(dict(message_type_counter)).sort_values())
break
messages.clear()
And here's the store_messages() function:
def store_messages(m):
"""Handle occasional storing of all messages"""
with pd.HDFStore(itch_store) as store:
for mtype, data in m.items():
# convert to DataFrame
data = pd.DataFrame(data)
# parse timestamp info
data.timestamp = data.timestamp.apply(int.from_bytes, byteorder='big')
data.timestamp = pd.to_timedelta(data.timestamp)
# apply alpha formatting
if mtype in alpha_formats.keys():
data = format_alpha(mtype, data)
s = alpha_length.get(mtype)
if s:
s = {c: s.get(c) for c in data.columns}
dc = ['stock_locate']
if m == 'R':
dc.append('stock')
try:
store.append(mtype,
data,
format='t',
min_itemsize=s,
data_columns=dc)
except Exception as e:
print(e)
print(mtype)
print(data.info())
print(pd.Series(list(m.keys())).value_counts())
data.to_csv('data.csv', index=False)
return 1
return 0
According to the code, file format looks like its 2 bytes of message size, one byte of message type and then n bytes of actual message (defined by the previously read message size).
Low hanging fruit to optimize this is to read 3 bytes first into list, convert [0:1] to message size int and [2] to message type and then read the message ..
To further eliminate amount of required reads, you could read a fixed amount of data from the file into a list of and start extracting from it. While extracting, keep a index of already processed bytes stored and once that index or index+amount of data to be read goes over the size of the list, you prepopulate the list. This could lead to huge memory requirements if not done properly thought..

"No space left on device" error while fitting Sklearn model

I'm fitting a LDA model with lots of data using scikit-learn. Relevant code piece looks like this:
lda = LatentDirichletAllocation(n_topics = n_topics,
max_iter = iters,
learning_method = 'online',
learning_offset = offset,
random_state = 0,
evaluate_every = 5,
n_jobs = 3,
verbose = 0)
lda.fit(X)
(I guess the only possibly relevant detail here is that I'm using multiple jobs.)
After some time I'm getting "No space left on device" error, even though there is plenty of space on the disk and plenty of free memory. I tried the same code several times, on two different computers (on my local machine and on a remote server), first using python3, then using python2, and each time I ended up with the same error.
If I run the same code on a smaller sample of data everything works fine.
The entire stack trace:
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 271, in save
obj, filename = self._write_array(obj, filename)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 231, in _write_array
self.np.save(filename, array)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 491, in save
pickle_kwargs=pickle_kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/format.py", line 584, in write_array
array.tofile(fp)
IOError: 275500 requested and 210934 written
IOErrorTraceback (most recent call last)
<ipython-input-7-6af7e7c9845f> in <module>()
7 n_jobs = 3,
8 verbose = 0)
----> 9 lda.fit(X)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in fit(self, X, y)
509 for idx_slice in gen_batches(n_samples, batch_size):
510 self._em_step(X[idx_slice, :], total_samples=n_samples,
--> 511 batch_update=False, parallel=parallel)
512 else:
513 # batch update
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _em_step(self, X, total_samples, batch_update, parallel)
403 # E-step
404 _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 405 parallel=parallel)
406
407 # M-step
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _e_step(self, X, cal_sstats, random_init, parallel)
356 self.mean_change_tol, cal_sstats,
357 random_state)
--> 358 for idx_slice in gen_even_slices(X.shape[0], n_jobs))
359
360 # merge result
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
808 # consumption.
809 self._iterating = False
--> 810 self.retrieve()
811 # Make sure that we get a last message telling us we are done
812 elapsed_time = time.time() - self._start_time
/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
725 job = self._jobs.pop(0)
726 try:
--> 727 self._output.extend(job.get())
728 except tuple(self.exceptions) as exception:
729 # Stop dispatching any new job in the async callback thread
/home/ubuntu/anaconda2/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
IOError: [Errno 28] No space left on device
Had the same problem with LatentDirichletAllocation. It seems, that your are running out of shared memory (/dev/shm when you run df -h). Try setting JOBLIB_TEMP_FOLDER environment variable to something different: e.g., to /tmp. In my case it has solved the problem.
Or just increase the size of the shared memory, if you have the appropriate rights for the machine you are training the LDA on.
This problem occurs when shared memory is consumed and no I/O operation is permissible. This is a frustrating problem that occurs to most of the Kaggle users while fitting machine learning models.
I overcame this problem by setting JOBLIB_TEMP_FOLDER variable using following code.
%env JOBLIB_TEMP_FOLDER=/tmp
The solution of #silterser solved the problem for me.
If you want to set the environment variable in the code do this:
import os
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
This is because you have set n_jobs=3. You could set it to 1, then shared memory will not be used, even though learning will take longer time. You can chose to select a joblib cache dir as per above answer, but bear in mind that this cache can quickly fill up your disk as well, depending on the dataset? and disk transactions can slow your job down.
I know it's kind of late, but I got over this problem by setting learning_method = 'batch'.
This could present other issues, such as extending training times, but it alleviated the problem of not having enough space on the shared memory.
Or maybe a smaller batch_size can be set. Although I have not tested this myself.

How convert entire hex file to bin?

I'm hoping to make use of intelhex, but am not fully understanding what it needs to be passed to simply convert an entire file.
In test-usage.py:
import os
import sys
from intelhex import hex2bin
with open("foo.hex", "r") as fin:
start = 0
end = 218
size = None
pad = None
print("start: {}\nend: {}\nsize: {}\npad: {}".format(start, end, size, pad))
hex2bin(fin, sys.stdout, start, end, size, pad)
Error:
start: 0
end: 218
size: None
pad: None
Traceback (most recent call last):
File "./test-usage.py", line 19, in <module>
hex2bin(fin, sys.stdout, start, end, size, pad)
File "/home/me/myvenv/lib/python3.4/site-packages/intelhex/__init__.py", line 1001, in hex2bin
h.tobinfile(fout, start, end)
File "/home/me/myvenv/lib/python3.4/site-packages/intelhex/__init__.py", line 412, in tobinfile
fobj.write(self._tobinstr_really(start, end, pad, size))
TypeError: must be str, not bytes
foo.hex borrowed from here:
:10001300AC12AD13AE10AF1112002F8E0E8F0F2244
:10000300E50B250DF509E50A350CF5081200132259
:03000000020023D8
:0C002300787FE4F6D8FD7581130200031D
:10002F00EFF88DF0A4FFEDC5F0CEA42EFEEC88F016
:04003F00A42EFE22CB
:00000001FF
http://python-intelhex.readthedocs.org/en/latest/part3-1.html
Usage of hex2bin in the example script source.
Support doesn't seem quite yet there for python3. Creating a new virtual environment with python2 (a little outside the scope of this question), I'm able to execute without error.
ENV_NAME=testenv
virtualenv -p /usr/bin/python2 ${ENV_NAME}
source ${ENV_NAME}/bin/activate
pip install IntelHex==2.0
./test-usage.py
start: 0
end: 218
size: None
pad: None
#�
� %
5
"����/"x����u����������Τ.�����.�"��������������������������������������������������������������������������������������������������������������������������������������������������������
I should also note that to simply convert the entire file, you should leave all non-file parameters as default. For example:
hex2bin(fin, sys.stdout)

numpy.memmap: bogus memory allocation

I have a python3 script that operates with numpy.memmap arrays. It writes an array to newly generated temporary file that is located in /tmp:
import numpy, tempfile
size = 2 ** 37 * 10
tmp = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'w+', shape = size)
array[0] = 666
array[size-1] = 777
del array
array2 = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
format(tmp.name, len(array2), array2[0], array2[size-1]))
while True:
pass
The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp, and the corresponding array still seems to be accessible. The output of the script is following:
File: /tmp/tmptjfwy8nr. Array size: 1374389534720. First cell value: 666. Last cell value: 777
The file really exists and is displayed as being 10T large:
$ ls -l /tmp/tmptjfwy8nr
-rw------- 1 user user 10995116277760 Dec 1 15:50 /tmp/tmptjfwy8nr
However, the whole size of /tmp is much smaller:
$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 235G 5.3G 218G 3% /
The process also is pretending to use 10T virtual memory, which is also not possible. The output of top command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31622 user 20 0 10.000t 16592 4600 R 100.0 0.0 0:45.63 python3
As far as I understand, this means that during the call of numpy.memmap the needed memory for the whole array is not allocated and therefore displayed file size is bogus. This in turn means that when I start to gradually fill the whole array with my data, at some point my program will crash or my data will be corrupted.
Indeed, if I introduce the following in my code:
for i in range(size):
array[i] = i
I get the error after a while:
Bus error (core dumped)
Therefore, the question: how to check at the beginning, if there is really enough memory for the data and then indeed reserve the space for the whole array?
There's nothing 'bogus' about the fact that you are generating 10 TB files
You are asking for arrays of size
2 ** 37 * 10 = 1374389534720 elements
A dtype of 'i8' means an 8 byte (64 bit) integer, therefore your final array will have a size of
1374389534720 * 8 = 10995116277760 bytes
or
10995116277760 / 1E12 = 10.99511627776 TB
If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?
Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.
For example, on my Linux machine I'm allowed to do something like this:
# I only have about 50GB of free space...
~$ df -h /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sdb1 ext4 459G 383G 53G 88% /
~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s
# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec 1 21:17 sparsefile
# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0 sparsefile
Try calling du -h on your np.memmap file after it has been initialized to see how much actual disk space it uses.
As you start actually writing data to your np.memmap file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error. This means that if you needed to write < 250GB of data to your np.memmap array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).
How is it possible for a process to use 10 TB of virtual memory?
When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.
How can you check whether you have enough disk space to store the full np.memmap array?
I'm assuming that you want to do this programmatically in Python.
Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is os.statvfs:
import os
def get_free_bytes(path='/'):
st = os.statvfs(path)
return st.f_bavail * st.f_bsize
print(get_free_bytes())
# 56224485376
Work out the size of your array in bytes:
import numpy as np
def check_asize_bytes(shape, dtype):
return np.prod(shape) * np.dtype(dtype).itemsize
print(check_asize_bytes((2 ** 37 * 10,), 'i8'))
# 10995116277760
Check whether 2. > 1.
Update: Is there a 'safe' way to allocate an np.memmap file, which guarantees that sufficient disk space is reserved to store the full array?
One possibility might be to use fallocate to pre-allocate the disk space, e.g.:
~$ fallocate -l 1G bigfile
~$ du -h bigfile
1.1G bigfile
You could call this from Python, for example using subprocess.check_call:
import subprocess
def fallocate(fname, length):
return subprocess.check_call(['fallocate', '-l', str(length), fname])
def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
nbytes = np.prod(shape) * np.dtype(dtype).itemsize
fallocate(fname, nbytes)
return np.memmap(fname, dtype, *args, shape=shape, **kwargs)
mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))
print(mmap.nbytes / 1E6)
# 8.388608
print(subprocess.check_output(['du', '-h', 'test.mmap']))
# 8.0M test.mmap
I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate Python module on PyPI that should work for any Posix-based OS.
Based on the answer of #ali_m I finally came to this solution:
# must be called with the argumant marking array size in GB
import sys, numpy, tempfile, subprocess
size = (2 ** 27) * int(sys.argv[1])
tmp_primary = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp_primary.name, dtype = 'i8', mode = 'w+', shape = size)
tmp = tempfile.NamedTemporaryFile('w+')
check = subprocess.Popen(['cp', '--sparse=never', tmp_primary.name, tmp.name])
stdout, stderr = check.communicate()
if stderr:
sys.stderr.write(stderr.decode('utf-8'))
sys.exit(1)
del array
tmp_primary.close()
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
array[0] = 666
array[size-1] = 777
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
format(tmp.name, len(array), array[0], array[size-1]))
while True:
pass
The idea is to copy initially generated sparse file to a new normal one. For this cp with the option --sparse=never is employed.
When the script is called with a manageable size parameter (say, 1 GB) the array is getting mapped to a non-sparse file. This is confirmed by the output of du -h command, which now shows ~1 GB size. If the memory is not enough, the scripts exits with the error:
cp: ‘/tmp/tmps_thxud2’: write failed: No space left on device

Creating random binary files

I'm trying to use python to create a random binary file. This is what I've got already:
f = open(filename,'wb')
for i in xrange(size_kb):
for ii in xrange(1024/4):
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
f.close()
But it's terribly slow (0.82 seconds for size_kb=1024 on my 3.9GHz SSD disk machine). A big bottleneck seems to be the random int generation (replacing the randint() with a 0 reduces running time from 0.82s to 0.14s).
Now I know there are more efficient ways of creating random data files (namely dd if=/dev/urandom) but I'm trying to figure this out for sake of curiosity... is there an obvious way to improve this?
IMHO - the following is completely redundant:
f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))
There's absolutely no need to use struct.pack, just do something like:
import os
fileSizeInBytes = 1024
with open('output_filename', 'wb') as fout:
fout.write(os.urandom(fileSizeInBytes)) # replace 1024 with a size in kilobytes if it is not unreasonably large
Then, if you need to re-use the file for reading integers, then struct.unpack then.
(my use case is generating a file for a unit test so I just need a
file that isn't identical with other generated files).
Another option is to just write a UUID4 to the file, but since I don't know the exact use case, I'm not sure that's viable.
The python code you should write completely depends on the way you intend to use the random binary file. If you just need a "rather good" randomness for multiple purposes, then the code of Jon Clements is probably the best.
However, on Linux OS at least, os.urandom relies on /dev/urandom, which is described in the Linux Kernel (drivers/char/random.c) as follows:
The /dev/urandom device [...] will return as many bytes as are
requested. As more and more random bytes are requested without giving
time for the entropy pool to recharge, this will result in random
numbers that are merely cryptographically strong. For many
applications, however, this is acceptable.
So the question is, is this acceptable for your application ? If you prefer a more secure RNG, you could read bytes on /dev/random instead. The main inconvenient of this device: it can block indefinitely if the Linux kernel is not able to gather enough entropy. There are also other cryptographically secure RNGs like EGD.
Alternatively, if your main concern is execution speed and if you just need some "light-randomness" for a Monte-Carlo method (i.e unpredictability doesn't matter, uniform distribution does), you could consider generate your random binary file once and use it many times, at least for development.
Here's a complete script based on accepted answer that creates random files.
import sys, os
def help(error: str = None) -> None:
if error and error != "help":
print("***",error,"\n\n",file=sys.stderr,sep=' ',end='');
sys.exit(1)
print("""\tCreates binary files with random content""", end='\n')
print("""Usage:""",)
print(os.path.split(__file__)[1], """ "name1" "1TB" "name2" "5kb"
Accepted units: MB, GB, KB, TB, B""")
sys.exit(2)
# https://stackoverflow.com/a/51253225/1077444
def convert_size_to_bytes(size_str):
"""Convert human filesizes to bytes.
ex: 1 tb, 1 kb, 1 mb, 1 pb, 1 eb, 1 zb, 3 yb
To reverse this, see hurry.filesize or the Django filesizeformat template
filter.
:param size_str: A human-readable string representing a file size, e.g.,
"22 megabytes".
:return: The number of bytes represented by the string.
"""
multipliers = {
'kilobyte': 1024,
'megabyte': 1024 ** 2,
'gigabyte': 1024 ** 3,
'terabyte': 1024 ** 4,
'petabyte': 1024 ** 5,
'exabyte': 1024 ** 6,
'zetabyte': 1024 ** 7,
'yottabyte': 1024 ** 8,
'kb': 1024,
'mb': 1024**2,
'gb': 1024**3,
'tb': 1024**4,
'pb': 1024**5,
'eb': 1024**6,
'zb': 1024**7,
'yb': 1024**8,
}
for suffix in multipliers:
size_str = size_str.lower().strip().strip('s')
if size_str.lower().endswith(suffix):
return int(float(size_str[0:-len(suffix)]) * multipliers[suffix])
else:
if size_str.endswith('b'):
size_str = size_str[0:-1]
elif size_str.endswith('byte'):
size_str = size_str[0:-4]
return int(size_str)
if __name__ == "__main__":
input = {} #{ file: byte_size }
if (len(sys.argv)-1) % 2 != 0:
print("-- Provide even number of arguments --")
print(f'--\tGot: {len(sys.argv)-1}: "' + r'" "'.join(sys.argv[1:]) +'"')
sys.exit(2)
elif len(sys.argv) == 1:
help()
try:
for file, size_str in zip(sys.argv[1::2], sys.argv[2::2]):
input[file] = convert_size_to_bytes(size_str)
except ValueError as ex:
print(f'Invalid size: "{size_str}"', file=sys.stderr)
sys.exit(1)
for file, size_bytes in input.items():
print(f"Writing: {file}")
#https://stackoverflow.com/a/14276423/1077444
with open(file, 'wb') as fout:
while( size_bytes > 0 ):
wrote = min(size_bytes, 1024) #chunk
fout.write(os.urandom(wrote))
size_bytes -= wrote

Categories

Resources