I've a big compressed file and I want to know the size of the content without uncompress it. I've tried this:
import gzip
import os
with gzip.open(data_file) as f:
f.seek(0, os.SEEK_END)
size = f.tell()
but I get this error
ValueError: Seek from end not supported
How can I do that?
Thx.
It is not possible in principle to definitively determine the size of the uncompressed data in a gzip file without decompressing it. You do not need to have the space to store the uncompressed data -- you can discard it as you go along. But you have to decompress it all.
If you control the source of the gzip file and can assure that a) there are no concatenated members in the gzip file, b) the uncompressed data is less than 4 GB in length, and c) there is no extraneous junk at the end of the gzip file, then and only then you can read the last four bytes of the gzip file to get a little-endian integer that has the length of the uncompressed data.
See this answer for more details.
Here is Python code to read a gzip file and print the uncompressed length, without having to store or save the uncompressed data. It limits the memory usage to small buffers. This requires Python 3.3 or greater:
#!/usr/local/bin/python3.4
import sys
import zlib
import warnings
f = open(sys.argv[1], "rb")
total = 0
buf = f.read(1024)
while True: # loop through concatenated gzip streams
z = zlib.decompressobj(15+16)
while True: # loop through one gzip stream
while True: # go through all output from one input buffer
total += len(z.decompress(buf, 4096))
buf = z.unconsumed_tail
if buf == b"":
break
if z.eof:
break # end of a gzip stream found
buf = f.read(1024)
if buf == b"":
warnings.warn("incomplete gzip stream")
break
buf = z.unused_data
z = None
if buf == b"":
buf == f.read(1024)
if buf == b"":
break
print(total)
Unfortunately, the Python 2.x gzip module doesn't appear to support any way of determining uncompressed file size.
However, gzip does store the uncompressed file size as a little-endian 32-bit unsigned integer at the very end of the file: http://www.abeel.be/content/determine-uncompressed-size-gzip-file
Unfortunately, this only works for files <4gb in size due to using only a 32-bit integer the gzip format; see the manual.
import os
import struct
with open(data_file,"rb") as f:
f.seek(-4, os.SEEK_END)
size, = struct.unpack("<I", f.read(4))
print size
To summerize, I need to open huges compressed files (> 4GB) so the technique of Dan won't work and I want the length (number of line) of the file so the technique of Mark Adler is not appropriate.
Eventually, I found for uncompressed files a solution( not the most optimized but it works!) which can be transposed easily to compressed files:
size = 0
with gzip.open(data_file) as f:
for line in f:
size+= 1
pass
return size
Thank you all, people in this forum are very effective!
Related
I have made a simple app that encrypts and decrypts files. but when i load a large file like a 2gb, my program uses 100% of the memory. I use multiprocessing and multi threading.
poolSize = min(cpu_count(), len(fileList))
process_pool = Pool(poolSize)
thread_pool = ThreadPool(len(fileList))
lock = Lock()
worker = partial(encfile, process_pool, lock)
thread_pool.map(worker, fileList)
def encfile(process_pool, lock, file):
with open(file, 'rb') as original_file:
original = original_file.read()
encrypted = process_pool.apply(encryptfn, args=(key, original,))
with open (file, 'wb') as encrypted_file:
encrypted_file.write(encrypted)
This is my general idea:
Since memory is a problem, you have to read the files in smaller chunks, say 64K pieces and encrypt each 64K block and write those out. Of course, the encrypted block will have a length other than 64K so the problem becomes how to decrypt. So each encrypted block must be prefixed with a fixed-length header that is nothing more than the length of the following encrypted block encoded as a 4-byte unsigned integer (which should be way more than adequate). The decryption algorithm loop first reads the next 4-byte length and then know from that how many bytes long is the encrypted block that follows.
By the way, there is no need to pass to encfile a lock if you are not using it to, for example, count files processed.
from tempfile import mkstemp
from os import fdopen, replace
BLOCKSIZE = 64 * 1024
ENCRYPTED_HEADER_LENGTH = 4
def encfile(process_pool, lock, file):
"""
Encrypt file in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as original_file, \
fdopen (fd, 'wb') as encrypted_file:
while True:
original = original_file.read(BLOCKSIZE)
if not original:
break
encrypted = process_pool.apply(encryptfn, args=(key, original))
l = len(encrypted)
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
encrypted_file.write(l_bytes)
encrypted_file.write(encrypted)
replace(path, file)
def decfile(file):
"""
Decrypt files in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as encrypted_file, \
fdopen (fd, 'wb') as original_file:
while True:
l_bytes = encrypted_file.read(ENCRYPTED_HEADER_LENGTH)
if not l_bytes:
break
l = int.from_bytes(l_bytes, 'big')
encrypted = encrypted_file.read(l)
decrypted = decryptfn(key, encrypted)
original_file.write(decrypted)
replace(path, file)
Explanation
The larger the block size the more memory is required (your original program read the entire file; this program will only read 64K at a time). But I am assuming that too small a block size results in too many calls to the encryption, which is done by multiprocessing and that would require more CPU overhead -- so it's a tradeoff. 64K was arbitrary. Increase by a lot if you have the memory. You can even try 1024 * 1024 (1M).
I attempted to explain this the following before, but let me elaborate:
So let's say when you encrypt a 64K block then encrypted size for that one particular 64K block ends up being 67,986 bytes long (a different 64K block encrypted will in general have a different length unless its unencrypted value happened to have been the same). If I just write out the data with no other information, I would need some way to know that to decrypt the file that it is first necessary to read back 67,986 bytes of data and pass that to the decrypt method (with the correct key, of course) because you have to decrypt the precise results of what was encrypted, no fewer nor no greater bytes. In other words, you can't just read back the encrypted file in arbitrary chunks and pass those chunks to the decrypt method. But what would be that way? So the only way to know how big each encrypted chunk is would be to prefix those chunks with a header that gives the length of the following chunk.
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big') takes the integerlength stored in variable l and encodes it as a byte array of size ENCRYPTED_HEADER_LENGTH in "big endian" order meaning that the bytes are arranged from high order bytes to low order bytes:
>>> ENCRYPTED_HEADER_LENGTH = 4
>>> l = 67986
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
>>> l_bytes
b'\x00\x01\t\x92'
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'little')
>>> l_bytes
b'\x92\t\x01\x00'
>>>
\t is the tab character with a value of \x09 so we would be writing out 0010992, which is a 4-byte hexadecimal value for 67986
Is it possible to gzip data via some amount of streaming, i.e. without loading all of the compressed data in memory at once?
For example, can I gzip a file that will be 10gb gzipped, on a machine with 2gb of memory?
At https://docs.python.org/3/library/gzip.html#gzip.compress, the gzip.compress function returns the bytes of the gzip, so must be all loaded in memory. But... it's not clear how gzip.open works internally: whether the zipped bytes will all be in memory at once. Does the gzip format itself make it particularly tricky to achieve a streaming gzip?
[This question is tagged with Python, but non-Python answers welcome as well]
You don't have to compress all 10gb at once. You can read the input data in chunks, and compress each chunk separately, so it doesn't have to all fit in memory at once.
chunksize = 100 * 1024 * 1024 # 100 mb chunks
with open("bigfile.txt") as infile:
while True:
chunk = infile.read(chunksize)
if not chunk:
break
compressed = gzip.compress(chunk)
# do something with compressed
If you're creating a compressed file, you can write the chunks directly to the gzip file.
with open("bigfile.txt") as infile, gzip.open("bigfile.txt.gz", "w") as gzipfile:
while True:
chunk = infile.read(chunksize)
if not chunk:
break
gzipfile.write(chunk)
[This is based on #Barmar's answer and comments]
You can achieve streaming gzip compression. The gzip module uses zlib which is documented to achieve streaming compression, and peeking into the gzip module source, it doesn't appear to load all the output bytes into memory.
You can also do this directly with the zlib module, for example with a small pipeline of generators:
import zlib
def yield_uncompressed_bytes():
# In a real case, would yield bytes pulled from the filesystem or the network
chunk = b'*' * 65000
for _ in range(0, 10000):
print('In: ', len(chunk))
yield chunk
def yield_compressed_bytes(_uncompressed_bytes):
compress_obj = zlib.compressobj(wbits=zlib.MAX_WBITS + 16)
for chunk in _uncompressed_bytes:
if compressed_bytes := compress_obj.compress(chunk):
yield compressed_bytes
if compressed_bytes := compress_obj.flush():
yield compressed_bytes
uncompressed_bytes = yield_uncompressed_bytes()
compressed_bytes = yield_compressed_bytes(uncompressed_bytes)
for chunk in compressed_bytes:
# In a real case, could save to the filesystem, or send over the network
print('Out:', len(chunk))
You can see that the In: are interspersed with the Out:, suggesting that the zlib compressobj is indeed not storing the whole output in memory.
I'm trying to replicate this bash command in Bash which returns each file gzipped 50MB each.
split -b 50m "file.dat.gz" "file.dat.gz.part-"
My attempt at the python equivalent
import gzip
infile_name = "file.dat.gz"
chunk = 50*1024*1024 # 50MB
with gzip.open(infile_name, 'rb') as infile:
for n, raw_bytes in enumerate(iter(lambda: infile.read(slice), '')):
print(n, chunk)
with gzip.open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
outfile.write(raw_bytes)
This returns 15MB each gzipped. When I gunzip the files, then they are 50MB each.
How do I split the gzipped file in python so that split up files are each 50MB each before gunzipping?
I don't believe that split works the way you think it does. It doesn't split the gzip file into smaller gzip files. I.e. you can't call gunzip on the individual files it creates. It literally breaks up the data into smaller chunks and if you want to gunzip it, you have to concatenate all the chunks back together first. So, to emulate the actual behavior with Python, we'd do something like:
infile_name = "file.dat.gz"
chunk = 50*1024*1024 # 50MB
with open(infile_name, 'rb') as infile:
for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
print(n, chunk)
with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
outfile.write(raw_bytes)
In reality we'd read multiple smaller input chunks to make one output chunk to use less memory.
We might be able to break the file into smaller files that we can individually gunzip, and still make our target size. Using something like a bytesIO stream, we could gunzip the file and gzip it into that memory stream until it was the target size then write it out and start a new bytesIO stream.
With compressed data, you have to measure the size of the output, not the size of the input as we can't predict how well the data will compress.
Here's a solution for emulating something like the split -l (split on lines) command option that will allow you to open each individual file with gunzip.
import io
import os
import shutil
from xopen import xopen
def split(infile_name, num_lines ):
infile_name_fp = infile_name.split('/')[-1].split('.')[0] #get first part of file name
cur_dir = '/'.join(infile_name.split('/')[0:-1])
out_dir = f'{cur_dir}/{infile_name_fp}_split'
if os.path.exists(out_dir):
shutil.rmtree(out_dir)
os.makedirs(out_dir) #create in same folder as the original .csv.gz file
m=0
part=0
buf=io.StringIO() #initialize buffer
with xopen(infile_name, 'rt') as infile:
for line in infile:
if m<num_lines: #fill up buffer
buf.write(line)
m+=1
else: #write buffer to file
with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
outfile.write(buf.getvalue())
m=0
part+=1
buf=io.StringIO() #flush buffer -> faster than seek(0); truncate(0);
#write whatever is left in buffer to file
with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
outfile.write(buf.getvalue())
buf.close()
Usage:
split('path/to/myfile.csv.gz', num_lines=100000)
Outputs a folder with split files at path/to/myfile_split.
Discussion: I've used xopen here for additional speed, but you may choose to use gzip.open if you want to stay with Python native packages. Performance-wise, I've benchmarked this to take about twice as long as a solution combining pigz and split. It's not bad, but could be better. The bottleneck is the for loop and the buffer, so maybe rewriting this to work asynchronously would be more performant.
I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:
def compress_file(input_filename, output_filename):
f_in = open(input_filename, 'rb')
f_out = gzip.open(output_filename, 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
def md5sum(filename):
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
return md5
However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?
The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.
The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).
I have tried to use zlib.compress but the returned string miss the header of a gzip file.
Edit: following #abarnert sggestion, I used python3 gzip.compress:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read in buffer
buff = f_in.read()
f_in.close()
# Compress this buffer
c_buff = gzip.compress(buff)
# Compute MD5
md5 = hashlib.md5(c_buff).hexdigest()
# Write compressed buffer
f_out = open(output_filename, 'wb')
f_out.write(c_buff)
f_out.close()
return md5
This produce a correct gzip file but the output is different at each run (the md5 is different):
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'
The gzip program doesn't have this problem:
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
I guess it's because the gzip module use the current time by default when creating a file (the gzip program use the modification of the input file I guess). There is no way to change that with gzip.compress.
I was thinking to create a gzip.GzipFile in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile.
Inspired by #zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read data in buffer
buff = f_in.read()
# Create output buffer
c_buff = cStringIO.StringIO()
# Create gzip file
input_file_stat = os.stat(input_filename)
mtime = input_file_stat[8]
gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
# Compress data in memory
gzip_obj.write(buff)
# Close files
f_in.close()
gzip_obj.close()
# Retrieve compressed data
c_data = c_buff.getvalue()
# Change OS value
c_data = c_data[0:9] + '\003' + c_data[10:]
# Really write compressed data
f_out = open(output_filename, "wb")
f_out.write(c_data)
# Compute MD5
md5 = hashlib.md5(c_data).hexdigest()
return md5
The output is the same at different run. Moreover the output of file is the same than gzip:
$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
However, md5 is different:
$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz
39dc3e5a52c71a25c53fcbc02e2702d5 4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4 ref_max/4327_010.pdf.gz
gzip -l is also different:
$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz
compressed uncompressed ratio uncompressed_name
7286404 7600522 4.1% ref_max/4327_010.pdf
7297310 7600522 4.0% 4327_010.pdf
I guess it's because the gzip program and the python gzip module (which is based on the C library zlib) have a slightly different algorithm.
Wrap a gzip.GzipFile object around an io.BytesIO object. (In Python 2, use cStringIO.StringIO instead.) After you close the GzipFile, you can retrieve the compressed data from the BytesIO object (using getvalue), hash it, and write it out to a real file.
Incidentally, you really shouldn't be using MD5 at all anymore.
I have tried to use zlib.compress but the returned string miss the header of a gzip file.
Of course. That's the whole difference between the zlib module and the gzip module; zlib just deals with zlib-deflate compression without gzip headers, gzip deals with zlib-deflate data with gzip headers.
So, just call gzip.compress instead, and the code you wrote but didn't show us should just work.
As a side note:
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
You almost certainly want to open the file in 'rb' mode here. You don't want to convert '\r\n' into '\n' (if on Windows), or decode the binary data as sys.getdefaultencoding() text (if on Python 3), so open it in binary mode.
Another side note:
Don't use line-based APIs on binary files. Instead of this:
f_out.writelines(f_in)
… do this:
f_out.write(f_in.read())
Or, if the files are too large to read into memory all at once:
for buf in iter(partial(f_in.read, 8192), b''):
f_out.write(buf)
And one last point:
With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
Does your system not have a tmp directory mounted on a faster drive?
In most cases, you don't need a real file. Either there's a string-based API (zlib.compress, gzip.compress, json.dumps, etc.), or the file-based API only requires a file-like object, like a BytesIO.
But when you do need a real temporary file, with a real file descriptor and everything, you almost always want to create it in the temporary directory.* In Python, you do this with the tempfile module.
For example:
def compress_and_md5(filename):
with tempfile.NamedTemporaryFile() as f_out:
with open(filename, 'rb') as f_in:
g_out = gzip.open(f_out)
g_out.write(f_in.read())
f_out.seek(0)
md5 = hashlib.md5(f_out.read()).hexdigest()
If you need an actual filename, rather than a file object, you can use f_in.name.
* The one exception is when you only want the temporary file to eventually rename it to a permanent location. In that case, of course, you usually want the temporary file to be in the same directory as the permanent location. But you can do that with tempfile just as easily. Just remember to pass delete=False.
I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError:
import gzip
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
f_out.writelines(f_in)
# or the following:
# for line in f_in:
# f_out.write(line)
The traceback I got is:
Traceback (most recent call last):
File "test.py", line 8, in <module>
f_out.writelines(f_in)
MemoryError
I have read some discussion about this issue, but still not quite clear how to handle this. Can someone give me a more understandable answer about how to deal with this problem?
The problem here has nothing to do with gzip, and everything to do with reading line by line from a 10GB file with no newlines in it:
As an additional note, the file I used to test the Python gzip functionality is generated by fallocate -l 10G bigfile_file.
That gives you a 10GB sparse file made entirely of 0 bytes. Meaning there are no newline bytes. Meaning the first line is 10GB long. Meaning it will take 10GB to read the first line. (Or possibly even 20 or 40GB, if you're using pre-3.3 Python and trying to read it as Unicode.)
If you want to copy binary data, don't copy line by line. Whether it's a normal file, a GzipFile that's decompressing for you on the fly, a socket.makefile(), or anything else, you will have the same problem.
The solution is to copy chunk by chunk. Or just use copyfileobj, which does that for you automatically.
import gzip
import shutil
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
By default, copyfileobj uses a chunk size optimized to be often very good and never very bad. In this case, you might actually want a smaller size, or a larger one; it's hard to predict which a priori.* So, test it by using timeit with different bufsize arguments (say, powers of 4 from 1KB to 8MB) to copyfileobj. But the default 16KB will probably be good enough unless you're doing a lot of this.
* If the buffer size is too big, you may end up alternating long chunks of I/O and long chunks of processing. If it's too small, you may end up needing multiple reads to fill a single gzip block.
That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.
But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.
I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.
#! /usr/bin/env python
import gzip
import sys
blocksize = 1 << 16 #64kB
def gzipfile(iname, oname, level):
with open(iname, 'rb') as f_in:
f_out = gzip.open(oname, 'wb', level)
while True:
block = f_in.read(blocksize)
if block == '':
break
f_out.write(block)
f_out.close()
return
def main():
if len(sys.argv) < 3:
print "gzip compress in_file to out_file"
print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
exit(1)
iname = sys.argv[1]
oname = sys.argv[2]
level = int(sys.argv[3]) if len(sys.argv) > 3 else 6
gzipfile(iname, oname, level)
if __name__ == '__main__':
main()
I'm running Python 2.6.6 and gzip.open() doesn't support with.
As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.
It is weird to get a memory error even when reading a file line by line. I suppose it is because you have very little available memory and very large lines. You should then use binary reads :
import gzip
#adapt size value : small values will take more time, high value could cause memory errors
size = 8096
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
while True:
data = f_in.read(size)
if data == '' : break
f_out.write(data)