I have a compressed file which I could uncompress on ubuntu command prompt using zlib-flate as below,
zlib-flate -uncompress < inputfile > outfile
Here inputfile is a compress file and outfile is the uncompressed version.
The compress file has a byte data.
I did not find the way to do the same using Python.
Please advise.
If the entire file fits in memory, zlib can do exactly this in a very straight forward manner;
import zlib
with open("input_file", "rb") as input_file:
input_data = input_file.read()
decompressed_data = zlib.decompress(input_data)
with open("output_file", "wb") as output_file:
output_file.write(decompressed_data)
If the file is too large to fit in memory, you may want to instead use zlib.decompressobj() which can do streaming but isn't quite as straight forward.
Related
I searched how to compress a file in python, and found an answer that was basically as described below:
with open(input_file, 'rb') as f_in, gzip.open(output_file, 'wb') as f_out:
f_out.write(f_in.read())
It works readily with a 1GB file. But I plan on compressing files up to 200 GB.
Are there any considerations I need to take into account? Is there a different way I should be doing it with large files like that?
The files are binary .img files (exports of a block device; usually with empty space at the end, thus the compression works wonderfully).
This will read the entire file into memory, causing problems for you if you don't have 200G available!
You may be able to simply pipe the file through gzip, avoiding Python which will handle doing the work in chunks
% gzip -c myfile.img > myfile.img.gz
Otherwise you should read the file in chunks (picking a large block size may provide some benefit)
BLOCK_SIZE = 8192
with open(myfile, "rb") as f_in, gzip.open(output_file, 'wb') as f_out:
while True:
content = f_in.read(BLOCK_SIZE)
if not content:
break
f_out.write(content)
I'm trying to replicate this bash command in Bash which returns each file gzipped 50MB each.
split -b 50m "file.dat.gz" "file.dat.gz.part-"
My attempt at the python equivalent
import gzip
infile_name = "file.dat.gz"
chunk = 50*1024*1024 # 50MB
with gzip.open(infile_name, 'rb') as infile:
for n, raw_bytes in enumerate(iter(lambda: infile.read(slice), '')):
print(n, chunk)
with gzip.open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
outfile.write(raw_bytes)
This returns 15MB each gzipped. When I gunzip the files, then they are 50MB each.
How do I split the gzipped file in python so that split up files are each 50MB each before gunzipping?
I don't believe that split works the way you think it does. It doesn't split the gzip file into smaller gzip files. I.e. you can't call gunzip on the individual files it creates. It literally breaks up the data into smaller chunks and if you want to gunzip it, you have to concatenate all the chunks back together first. So, to emulate the actual behavior with Python, we'd do something like:
infile_name = "file.dat.gz"
chunk = 50*1024*1024 # 50MB
with open(infile_name, 'rb') as infile:
for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
print(n, chunk)
with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
outfile.write(raw_bytes)
In reality we'd read multiple smaller input chunks to make one output chunk to use less memory.
We might be able to break the file into smaller files that we can individually gunzip, and still make our target size. Using something like a bytesIO stream, we could gunzip the file and gzip it into that memory stream until it was the target size then write it out and start a new bytesIO stream.
With compressed data, you have to measure the size of the output, not the size of the input as we can't predict how well the data will compress.
Here's a solution for emulating something like the split -l (split on lines) command option that will allow you to open each individual file with gunzip.
import io
import os
import shutil
from xopen import xopen
def split(infile_name, num_lines ):
infile_name_fp = infile_name.split('/')[-1].split('.')[0] #get first part of file name
cur_dir = '/'.join(infile_name.split('/')[0:-1])
out_dir = f'{cur_dir}/{infile_name_fp}_split'
if os.path.exists(out_dir):
shutil.rmtree(out_dir)
os.makedirs(out_dir) #create in same folder as the original .csv.gz file
m=0
part=0
buf=io.StringIO() #initialize buffer
with xopen(infile_name, 'rt') as infile:
for line in infile:
if m<num_lines: #fill up buffer
buf.write(line)
m+=1
else: #write buffer to file
with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
outfile.write(buf.getvalue())
m=0
part+=1
buf=io.StringIO() #flush buffer -> faster than seek(0); truncate(0);
#write whatever is left in buffer to file
with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
outfile.write(buf.getvalue())
buf.close()
Usage:
split('path/to/myfile.csv.gz', num_lines=100000)
Outputs a folder with split files at path/to/myfile_split.
Discussion: I've used xopen here for additional speed, but you may choose to use gzip.open if you want to stay with Python native packages. Performance-wise, I've benchmarked this to take about twice as long as a solution combining pigz and split. It's not bad, but could be better. The bottleneck is the for loop and the buffer, so maybe rewriting this to work asynchronously would be more performant.
So the task was to compress a .txt file which i did here is the code for that
import gzip
import shutil
with open('dictionary.txt', 'rb') as f_input, gzip.open('dictionary.txt.gz', 'wb') as f_output:
shutil.copyfileobj(f_input, f_output)
Now my task is to open that compressed file and and recreate the full text, including punctuation and capitalization.
Here is what i got at the moment but it's not quite working :/ feel like somethings very wrong with it.
import gzip
with gzip.open('dictionary.txt.gz', 'rb') as f:
file_content = f.read()
If anyone could see why this isn't working that would be very appreciated.
I just need to open a compressed file and recreate the full text.enter image description here
The image below is what's in the compressed file and the .txt file
I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:
def compress_file(input_filename, output_filename):
f_in = open(input_filename, 'rb')
f_out = gzip.open(output_filename, 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
def md5sum(filename):
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
return md5
However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?
The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.
The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).
I have tried to use zlib.compress but the returned string miss the header of a gzip file.
Edit: following #abarnert sggestion, I used python3 gzip.compress:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read in buffer
buff = f_in.read()
f_in.close()
# Compress this buffer
c_buff = gzip.compress(buff)
# Compute MD5
md5 = hashlib.md5(c_buff).hexdigest()
# Write compressed buffer
f_out = open(output_filename, 'wb')
f_out.write(c_buff)
f_out.close()
return md5
This produce a correct gzip file but the output is different at each run (the md5 is different):
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'
The gzip program doesn't have this problem:
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
I guess it's because the gzip module use the current time by default when creating a file (the gzip program use the modification of the input file I guess). There is no way to change that with gzip.compress.
I was thinking to create a gzip.GzipFile in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile.
Inspired by #zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read data in buffer
buff = f_in.read()
# Create output buffer
c_buff = cStringIO.StringIO()
# Create gzip file
input_file_stat = os.stat(input_filename)
mtime = input_file_stat[8]
gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
# Compress data in memory
gzip_obj.write(buff)
# Close files
f_in.close()
gzip_obj.close()
# Retrieve compressed data
c_data = c_buff.getvalue()
# Change OS value
c_data = c_data[0:9] + '\003' + c_data[10:]
# Really write compressed data
f_out = open(output_filename, "wb")
f_out.write(c_data)
# Compute MD5
md5 = hashlib.md5(c_data).hexdigest()
return md5
The output is the same at different run. Moreover the output of file is the same than gzip:
$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
However, md5 is different:
$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz
39dc3e5a52c71a25c53fcbc02e2702d5 4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4 ref_max/4327_010.pdf.gz
gzip -l is also different:
$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz
compressed uncompressed ratio uncompressed_name
7286404 7600522 4.1% ref_max/4327_010.pdf
7297310 7600522 4.0% 4327_010.pdf
I guess it's because the gzip program and the python gzip module (which is based on the C library zlib) have a slightly different algorithm.
Wrap a gzip.GzipFile object around an io.BytesIO object. (In Python 2, use cStringIO.StringIO instead.) After you close the GzipFile, you can retrieve the compressed data from the BytesIO object (using getvalue), hash it, and write it out to a real file.
Incidentally, you really shouldn't be using MD5 at all anymore.
I have tried to use zlib.compress but the returned string miss the header of a gzip file.
Of course. That's the whole difference between the zlib module and the gzip module; zlib just deals with zlib-deflate compression without gzip headers, gzip deals with zlib-deflate data with gzip headers.
So, just call gzip.compress instead, and the code you wrote but didn't show us should just work.
As a side note:
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
You almost certainly want to open the file in 'rb' mode here. You don't want to convert '\r\n' into '\n' (if on Windows), or decode the binary data as sys.getdefaultencoding() text (if on Python 3), so open it in binary mode.
Another side note:
Don't use line-based APIs on binary files. Instead of this:
f_out.writelines(f_in)
… do this:
f_out.write(f_in.read())
Or, if the files are too large to read into memory all at once:
for buf in iter(partial(f_in.read, 8192), b''):
f_out.write(buf)
And one last point:
With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
Does your system not have a tmp directory mounted on a faster drive?
In most cases, you don't need a real file. Either there's a string-based API (zlib.compress, gzip.compress, json.dumps, etc.), or the file-based API only requires a file-like object, like a BytesIO.
But when you do need a real temporary file, with a real file descriptor and everything, you almost always want to create it in the temporary directory.* In Python, you do this with the tempfile module.
For example:
def compress_and_md5(filename):
with tempfile.NamedTemporaryFile() as f_out:
with open(filename, 'rb') as f_in:
g_out = gzip.open(f_out)
g_out.write(f_in.read())
f_out.seek(0)
md5 = hashlib.md5(f_out.read()).hexdigest()
If you need an actual filename, rather than a file object, you can use f_in.name.
* The one exception is when you only want the temporary file to eventually rename it to a permanent location. In that case, of course, you usually want the temporary file to be in the same directory as the permanent location. But you can do that with tempfile just as easily. Just remember to pass delete=False.
I am attempting to unzip files of various sizes (some are 4GB or above in size) using python, however I have noticed that on several occasions especially when the files are extremely large the file fails to unzip. When I open the new result file it is empty. Below is the code i am using - is there anything wrong with my approach?
inF = gzip.open(localFile, 'rb')
localFile = localFile[:-3]
outF = open(localFile, 'wb')
outF.write( inF.read() )
inF.close()
outF.close()
in this case it looks like you don't need python to do any processing on the file you read in so you might be better off just using subprocess.Popen:
from subprocess import Popen
Popen('gunzip %s %s' % (infilename, outfilename)).wait()
you might need to pass shell=True, but other than that should be good
Another solution for large .zip files (works on Ubuntu 16.04.4).
First install 7z:
sudo apt-get install p7zip-full
Then in your python code, call 7zip with:
import subprocess
subprocess.call(['7z', 'x', src_file, '-o'+target_dir])
This code loops of blocks of input data, writing each to an output file. In this way we don't read the entire input into memory at once, conserving memory and avoiding mysterious crashes.
import gzip, os
localFile = 'cat.gz'
outFile = os.path.splitext(localFile)[0]
print 'Unzipping {} to {}'.format(localFile, outFile)
with gzip.open(localFile, 'rb') as inF:
with open( outFile, 'wb') as outF:
outF.write( inF.read(size=1024) )