Python 3.8 lzma decompress huge file incremental input and output - python

I am looking to do, in Python 3.8, the equivalent of:
xz --decompress --stdout < hugefile.xz > hugefile.out
where neither the input nor output might fit well in memory.
As I read the documentation at https://docs.python.org/3/library/lzma.html#lzma.LZMADecompressor
I could use LZMADecompressor to process incrementally available input, and I could use its decompress() function to produce output incrementally.
However it seems that LZMADecompressor puts its entire decompressed output into a single memory buffer, and decompress() reads its entire compressed input from a single input memory buffer.
Granted, the documentation confuses me as to when the input and/or output can be incremental.
So I figure I will have to spawn a separate child process to execute the "xz" binary.
Is there anyway of using the lzma Python module for this task?

Instead of using the low-level LZMADecompressor, use lzma.open to get a file object. Then, you can copy data into an other file object with the shutil module:
import lzma
import shutil
with lzma.open("hugefile.xz", "rb") as fsrc:
with open("hugefile.out", "wb") as fdst:
shutil.copyfileobj(fsrc, fdst)
Internally, shutils.copyfileobj reads and write data in chunks, and the LZMA decompression is done on the fly. This avoids decompressing the whole data into memory.

Related

Parallel bzip2 decompression in Python [duplicate]

I am using pythons bz2 module to generate (and compress) a large jsonl file (bzip2 compressed 17GB).
However, when I later try to decompress it using pbzip2 it only seems to use one CPU-core for decompression, which is quite slow.
When i compress it with pbzip2 it can leverage multiple cores on decompression. Is there a way to compress within python in the pbzip2-compatible format?
import bz2,sys
from Queue import Empty
#...
compressor = bz2.BZ2Compressor(9)
f = open(path, 'a')
try:
while 1:
m = queue.get(True, 1*60)
f.write(compressor.compress(m+"\n"))
except Empty, e:
pass
except Exception as e:
traceback.print_exc()
finally:
sys.stderr.write("flushing")
f.write(compressor.flush())
f.close()
A pbzip2 stream is nothing more than the concatenation of multiple bzip2 streams.
An example using the shell:
bzip2 < /usr/share/dict/words > words_x_1.bz2
cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2
time bzip2 -d < words_x_10.bz2 > /dev/null
time pbzip2 -d < words_x_10.bz2 > /dev/null
I've never used python's bz2 module, but it should be easy to close/reopen a stream in 'a'ppend mode, every so-many bytes, to get the same result. Note that if BZ2File is constructed from an existing file-like object, closing the BZ2File will not close the underlying stream (which is what you want here).
I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.
Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access. This is how the dictzip program works, though that is based on gzip.
If you absolutely must use pbzip2 on decompression this won't help you, but the alternative lbzip2 can perform multicore decompression of "normal" .bz2 files, such as those generated by Python's BZ2File or a traditional bzip2 command. This avoids the limitation of pbzip2 you're describing, where it can only achieve parallel decompression if the file is also compressed using pbzip2. See https://lbzip2.org/.
As a bonus, benchmarks suggest lbzip2 is substantially faster than pbzip2, both on decompression (by 30%) and compression (by 40%) while achieving slightly superior compression ratios. Further, its peak RAM usage is less than 50% of the RAM used by pbzip2. See https://vbtechsupport.com/1614/.

os.read() gives OSError: [Errno 22] Invalid argument when reading large data

I use the following method to read binary data from any given offset in the binary file. The binary file I have is huge 10GB, so I usually read portion of it when needed by specifying from which offset I should start_read and how many bytes to read num_to_read. I use Python 3.6.4 :: Anaconda, Inc., platform Darwin-17.6.0-x86_64-i386-64bit and os module:
def read_from_disk(path, start_read, num_to_read, dim):
fd = os.open(path, os.O_RDONLY)
os.lseek(fd, start_read, 0) # Where to (start_read) from the beginning 0
raw_data = os.read(fd, num_to_read) # How many bytes to read
C = np.frombuffer(raw_data, dtype=np.int64).reshape(-1, dim).astype(np.int8)
os.close(fd)
return C
This method works very well when the chunk of data to be read is about less than 2GB. When num_to_read > 2GG, I get this error:
raw_data = os.read(fd, num_to_read) # How many to read (num_to_read)
OSError: [Errno 22] Invalid argument
I am not sure why this issue appears and how to fix it. Any help is highly appreciated.
The os.read function is just a thin wrapper around the platform's read function.
On some platforms, this is an unsigned or signed 32-bit int,1 which means the largest you can read in a single go on these platforms is, respectively, 4GB or 2GB.
So, if you want to read more than that, and you want to be cross-platform, you have to write code to handle this, and to buffer up multiple reads.
This may be a bit of a pain, but you are intentionally using the lowest-level directly-mapping-to-the-OS-APIs function here. If you don't like that:
Use io module objects (Python 3.x) or file objects (2.7) that you get back from open instead.
Just let NumPy read the files—which will have the added advantage that NumPy is smart enough to not try to read the whole thing into memory at once in the first place.
Or, for files this large, you may want to go lower level and use mmap (assuming you're on a 64-bit platform).
The right thing to do here is almost certainly a combination of the first two. In Python 3, it would look like this:
with open(path, 'rb', buffering=0) as f:
f.seek(start_read)
count = num_to_read // 8 # how many int64s to read
return np.fromfile(f, dtype=np.int64, count=count).reshape(-1, dim).astype(np.int8)
1. For Windows, the POSIX-emulation library's _read function uses int for the count argument, which is signed 32-bit. For every other modern platform, see POSIX read, and then look up the definitions of size_t, ssize_t, and off_t, on your platform. Notice that many POSIX platforms have separate 64-bit types, and corresponding functions, instead of changing the meaning of the existing types to 64-bit. Python will use the standard types, not the special 64-bit types.

Piping from Python's ftplib without blocking

Ideally what I'd like to do is replicate this bash pipeline in python (I'm using cut here to represent some arbitrary transformation of the data. I actually want to use pandas to do this):
curl ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz | gunzip | cut -f 1,2,4
I can write the following code in python, which achieves the same goal
# Download the zip file into memory
file = io.BytesIO()
ftp = ftplib.FTP('hgdownload.cse.ucsc.edu')
ftp.retrbinary(f'RETR goldenPath/{args.reference}/database/refFlat.txt.gz', file.write)
# Unzip the gzip file
table = gzip.GzipFile(fileobj=file)
# Read into pandas
df = pd.read_csv(table)
However, the ftp.retrbinary() call blocks, and waits for the whole download. What I want is to have one long binary stream, with the FTP file as the source, with a gunzip as a filter, and with pd.read_csv() as a sink, all simultaneously processing data, as in my bash pipeline. Is there a way to stop retrbinary() from blocking?
I realise this may be impossible because python can't use more than one thread. Is this true? If so, can I use multiprocessing or async or some other language feature to achieve this simultaneous pipeline
edit: changed storbinary to retrbinary, this was a typo and the problem still stands
You should be able to download the file directly to the GZipFile:
gzipfile = gzip.GzipFile()
ftp.retrbinary(f'RETR goldenPath/{args.reference}/database/refFlat.txt.gz', gzipfile.write)

Load data to a numpy array by piping the output of an external program

I have a program that outputs a huge array of float32 preceded by a 40 bytes header. The program writes to stdout.
I can dump the output into a file, open it in python, skip the 40 bytes of the header, and load it into numpy using numpy.fromfile(). However that takes a lot of time.
So what I would like to do is to load the array directly into numpy by reading the stdout of the program that generates it. However I'm having a hard time figuring this out.
Thanks!
You can memory-map the file instead of reading it all. This will take almost no time:
np.memmap(filename, np.float32, offset=40)
Of course actually reading data from the result will take some time, but probably this will be hidden by interleaving the I/O with computation.
If you really don't want the data to ever be written to disk, you could use subprocess.Popen() to run your program with stdout=subprocess.PIPE and pass the resulting stdout file-like object directly to numpy.fromfile().
Many thanks to all that responded/commented.
I followed downshift's advice and looked into the link he provides...
And came up with the folowing:
nLayers=<Number of layers in program output>
nRows=<Number of rows in layer>
nCols=<Number of columns in layer>
nBytes=<Number of bytes for each value>
noDataValue=<Value used to code no data in program output>
DataType=<appropriate numpy data type for values>
perLayer=nRows*nCols*nBytes
proc = sp.Popen(cmd+args, stdout = sp.PIPE, shell=True)
data=bytearray()
for i in range(0,nLayers):
dump40=proc.stdout.read(40)
data=data+bytearray(proc.stdout.read(perLayer))
ndata = np.frombuffer(data, dtype=DataType)
ndata[ndata == noDataValue]=np.nan
ndata.shape = (nLayers,nRows,nCols)
the key here is using the numpy.frombuffer which uses the same read buffer to create the darray, and thus avoids having to duplicate the data in memory.

Python decompressing gzip chunk-by-chunk

I've a memory- and disk-limited environment where I need to decompress the contents of a gzip file sent to me in string-based chunks (over xmlrpc binary transfer). However, using the zlib.decompress() or zlib.decompressobj()/decompress() both barf over the gzip header. I've tried offsetting past the gzip header (documented here), but still haven't managed to avoid the barf. The gzip library itself only seems to support decompressing from files.
The following snippet gives a simplified illustration of what I would like to do (except in real life the buffer will be filled from xmlrpc, rather than reading from a local file):
#! /usr/bin/env python
import zlib
CHUNKSIZE=1000
d = zlib.decompressobj()
f=open('23046-8.txt.gz','rb')
buffer=f.read(CHUNKSIZE)
while buffer:
outstr = d.decompress(buffer)
print(outstr)
buffer=f.read(CHUNKSIZE)
outstr = d.flush()
print(outstr)
f.close()
Unfortunately, as I said, this barfs with:
Traceback (most recent call last):
File "./test.py", line 13, in <module>
outstr = d.decompress(buffer)
zlib.error: Error -3 while decompressing: incorrect header check
Theoretically, I could feed my xmlrpc-sourced data into a StringIO and then use that as a fileobj for gzip.GzipFile(), however, in real life, I don't have memory available to hold the entire file contents in memory as well as the decompressed data. I really do need to process it chunk-by-chunk.
The fall-back would be to change the compression of my xmlrpc-sourced data from gzip to plain zlib, but since that impacts other sub-systems I'd prefer to avoid it if possible.
Any ideas?
gzip and zlib use slightly different headers.
See How can I decompress a gzip stream with zlib?
Try d = zlib.decompressobj(16+zlib.MAX_WBITS).
And you might try changing your chunk size to a power of 2 (say CHUNKSIZE=1024) for possible performance reasons.
I've got a more detailed answer here: https://stackoverflow.com/a/22310760/1733117
d = zlib.decompressobj(zlib.MAX_WBITS|32)
per documentation this automatically detects the header (zlib or gzip).

Categories

Resources