unzipping large files using python

unzipping large files using python - python

I am attempting to unzip files of various sizes (some are 4GB or above in size) using python, however I have noticed that on several occasions especially when the files are extremely large the file fails to unzip. When I open the new result file it is empty. Below is the code i am using - is there anything wrong with my approach?
inF = gzip.open(localFile, 'rb')
localFile = localFile[:-3]
outF = open(localFile, 'wb')
outF.write( inF.read() )
inF.close()
outF.close()

in this case it looks like you don't need python to do any processing on the file you read in so you might be better off just using subprocess.Popen:
from subprocess import Popen
Popen('gunzip %s %s' % (infilename, outfilename)).wait()
you might need to pass shell=True, but other than that should be good

Another solution for large .zip files (works on Ubuntu 16.04.4).
First install 7z:
sudo apt-get install p7zip-full
Then in your python code, call 7zip with:
import subprocess
subprocess.call(['7z', 'x', src_file, '-o'+target_dir])

This code loops of blocks of input data, writing each to an output file. In this way we don't read the entire input into memory at once, conserving memory and avoiding mysterious crashes.
import gzip, os
localFile = 'cat.gz'
outFile = os.path.splitext(localFile)[0]
print 'Unzipping {} to {}'.format(localFile, outFile)
with gzip.open(localFile, 'rb') as inF:
with open( outFile, 'wb') as outF:
outF.write( inF.read(size=1024) )

Related

Python zlib-inflate alternative

I have a compressed file which I could uncompress on ubuntu command prompt using zlib-flate as below,
zlib-flate -uncompress < inputfile > outfile
Here inputfile is a compress file and outfile is the uncompressed version.
The compress file has a byte data.
I did not find the way to do the same using Python.
Please advise.

If the entire file fits in memory, zlib can do exactly this in a very straight forward manner;
import zlib
with open("input_file", "rb") as input_file:
input_data = input_file.read()
decompressed_data = zlib.decompress(input_data)
with open("output_file", "wb") as output_file:
output_file.write(decompressed_data)
If the file is too large to fit in memory, you may want to instead use zlib.decompressobj() which can do streaming but isn't quite as straight forward.

How to properly compress and decompress a text file using bz2 and python

So I've had this system that scrapes and compresses files for a while now using bz2 compression. The way it does so is using the following block of code I found on SO a few months back:
Let's assume for the purposes of this post the filename is always file.XXXX where XXXX is the relevant extension. We start with .txt
### How to compress a text file
filepath_compressed = "file.tar.bz2"
with open("file.txt", 'rb') as data:
tarbz2contents = bz2.compress(data.read(), 9)
with bz2.BZ2File(filepath_compressed, 'wb') as f_comp:
f_comp.write(tarbz2contents)
Now, to decompress it, I've always got it to work using a decompression software I have called Keka which decompresses the .tar.bz2 file to .tar, then I run it through Keka again to get an "extensionless" file which I then add a .txt to on my mac and then it works.
Now, to do decompress programmatically, I've tried a few things. I've tried the stuff from this post and the code from this post. I've tried using BZ2Decompressor and BZ2File and everything. I just seem to be missing something and I'm not sure what it is.
Here is what I have so far, and I'd like to know what is wrong with this code:
import bz2, tarfile, shutil
# Decompress to tar
with bz2.BZ2File("file.tar.bz2") as fr, open("file.tar", "wb") as fw:
shutil.copyfileobj(fr, fw)
# Decompress from tar to txt
with tarfile.open("file.tar", "r:") as tar:
tar.extractall("file_out.txt")
This code crashes because of a "tarfile.ReadError: truncated header" problem. I think the first context manager outputs a binary text file, and I tried decoding that but that failed too. What am i missing here i feel like a noob.
If you would like a minimum runnable piece of code to replicate this, add the following to make a dummy file:
lines = ["Line 1","Line 2", "Line 3"]
with open("file.txt", "w") as f:
for line in lines:
f.write(line+"\n")

The thing that you're making is not a .tar.bz2 file, but rather a .bz2.bz2 file. You are compressing twice with bzip2 (the second time with no effect), and there is no tar file generation anywhere to be seen.

Using GZIP Module with Python

I'm trying to use the Python GZIP module to simply uncompress several .gz files in a directory. Note that I do not want to read the files, only uncompress them. After searching this site for a while, I have this code segment, but it does not work:
import gzip
import glob
import os
for file in glob.glob(PATH_TO_FILE + "/*.gz"):
#print file
if os.path.isdir(file) == False:
shutil.copy(file, FILE_DIR)
# uncompress the file
inF = gzip.open(file, 'rb')
s = inF.read()
inF.close()
the .gz files are in the correct location, and I can print the full path + filename with the print command, but the GZIP module isn't getting executed properly. what am I missing?

If you get no error, the gzip module probably is being executed properly, and the file is already getting decompressed.
The precise definition of "decompressed" varies on context:
I do not want to read the files, only uncompress them
The gzip module doesn't work as a desktop archiving program like 7-zip - you can't "uncompress" a file without "reading" it. Note that "reading" (in programming) usually just means "storing (temporarily) in the computer RAM", not "opening the file in the GUI".
What you probably mean by "uncompress" (as in a desktop archiving program) is more precisely described (in programming) as "read a in-memory stream/buffer from a compressed file, and write it to a new file (and possibly delete the compressed file afterwards)"
inF = gzip.open(file, 'rb')
s = inF.read()
inF.close()
With these lines, you're just reading the stream. If you expect a new "uncompressed" file to be created, you just need to write the buffer to a new file:
with open(out_filename, 'wb') as out_file:
out_file.write(s)
If you're dealing with very large files (larger than the amount of your RAM), you'll need to adopt a different approach. But that is the topic for another question.

You're decompressing file into s variable, and do nothing with it. You should stop searching stackoverflow and read at least python tutorial. Seriously.
Anyway, there's several thing wrong with your code:
you need is to STORE the unzipped data in s into some file.
there's no need to copy the actual *.gz files. Because in your code, you're unpacking the original gzip file and not the copy.
you're using file, which is a reserved word, as a variable. This is not
an error, just a very bad practice.
This should probably do what you wanted:
import gzip
import glob
import os
import os.path
for gzip_path in glob.glob(PATH_TO_FILE + "/*.gz"):
if os.path.isdir(gzip_path) == False:
inF = gzip.open(gzip_path, 'rb')
# uncompress the gzip_path INTO THE 's' variable
s = inF.read()
inF.close()
# get gzip filename (without directories)
gzip_fname = os.path.basename(gzip_path)
# get original filename (remove 3 characters from the end: ".gz")
fname = gzip_fname[:-3]
uncompressed_path = os.path.join(FILE_DIR, fname)
# store uncompressed file data from 's' variable
open(uncompressed_path, 'w').write(s)

You should use with to open files and, of course, store the result of reading the compressed file. See gzip documentation:
import gzip
import glob
import os
import os.path
for gzip_path in glob.glob("%s/*.gz" % PATH_TO_FILE):
if not os.path.isdir(gzip_path):
with gzip.open(gzip_path, 'rb') as in_file:
s = in_file.read()
# Now store the uncompressed data
path_to_store = gzip_fname[:-3] # remove the '.gz' from the filename
# store uncompressed file data from 's' variable
with open(path_to_store, 'w') as f:
f.write(s)
Depending on what exactly you want to do, you might want to have a look at tarfile and its 'r:gz' option for opening files.

I was able to resolve this issue by using the subprocess module:
for file in glob.glob(PATH_TO_FILE + "/*.gz"):
if os.path.isdir(file) == False:
shutil.copy(file, FILE_DIR)
# uncompress the file
subprocess.call(["gunzip", FILE_DIR + "/" + os.path.basename(file)])
Since my goal was to simply uncompress the archive, the above code accomplishes this. The archived files are located in a central location, and are copied to a working area, uncompressed, and used in a test case. the GZIP module was too complicated for what I was trying to accomplish.
Thanks for everyone's help. It is much appreciated!

I think there is a much simpler solution than the others presented given the op only wanted to extract all the files in a directory:
import glob
from setuptools import archive_util
for fn in glob.glob('*.gz'):
archive_util.unpack_archive(fn, '.')

sorting file in place with Python on unix system

I'm sorting a text file from Python using a custom unix command that takes a filename as input (or reads from stdin) and writes to stdout. I'd like to sort myfile and keep the sorted version in its place. Is the best way to do this from Python to make a temporary file? My current solution is:
inputfile = "myfile"
# inputfile: filename to be sorted
tmpfile = "%s.tmp_file" %(inputfile)
cmd = "mysort %s > %s" %(inputfile, tmpfile)
# rename sorted file to be originally sorted filename
os.rename(tmpfile, inputfile)
Is this the best solution? thanks.

If you don't want to create temporary files, you can use subprocess as in:
import sys
import subprocess
fname = sys.argv[1]
proc = subprocess.Popen(['sort', fname], stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
with open(fname, 'w') as f:
f.write(stdout)

You either create a temporary file, or you'll have to read the whole file into memory and pipe it to your command.

The best solution is to use os.replace because it would work on Windows too.
This is not really what I regards as "in-place sorting" though. Usually, in-place sorting means that you actually exchange single elements in the lists without doing copies. You are making a copy since the sorted list has to get completely built before you can overwrite the original. If your files get very large, this obviously won't work anymore. You'd probably need to choose between atomicity and in-place-ity at that point.
If your Python is too old to have os.replace, there are lots of resources in the bug adding os.replace.
For other uses of temporary files, you can consider using the tempfile module, but I don't think it would gain you much in this case.

You could try a truncate-write pattern:
with open(filename, 'r') as f:
model.read(f)
model.process()
with open(filename, 'w') as f:
model.write(f)
Note this is non-atomic
This entry describes some pros/cons of updating files in Python:
http://blog.gocept.com/2013/07/15/reliable-file-updates-with-python/

Python: Closing and removing files

I'm trying to unzip a file, and read one of the extracted files, and delete the extracted files.
Files extracted (e.g. we got file1 and file2)
Read file1, and close it.
with open(file1, 'r') as f:
data = f.readline()
f.close()
Do something with the "data".
Remove the files extracted.
os.remove(file1)
Everything went fine, except it received these messages at the end. The files were also removed. How do I close the files properly?
/tmp/file1: No such file or directory
140347508795048:error:02001002:system library:fopen:No such file or directory:bss_file.c:398:fopen('/tmp/file1','r')
140347508795048:error:20074002:BIO routines:FILE_CTRL:system lib:bss_file.c:400:
UPDATE:
(My script looks similar to these)
#!/usr/bin/python
import subprocess, os
infile = "filename.enc"
outfile = "filename.dec"
opensslCmd = "openssl enc -a -d -aes-256-cbc -in %s -out %s" % (infile, outfile)
subprocess.Popen(opensslCmd, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, close_fds=True)
os.remove(infile)

No need to close a file handle when using with with a file context manager, the handle is automatically closed when the scope have changed, i.e. when readline is done.
See python tutorial

The errors you see are not errors as Python would report them. They mean something other than Python tried to open these files, although it's hard to tell what from your little snippet.
If you're simply trying to retrieve some data from a zip file, there isn't really a reason to extract them to disk. You can simply read the data directly from the zip file, extracting to memory only, with zipfile.ZipFile.open.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unzipping large files using python - python

Another solution for large .zip files (works on Ubuntu 16.04.4). First install 7z: sudo apt-get install p7zip-full Then in your python code, call 7zip with: import subprocess subprocess.call(['7z', 'x', src_file, '-o'+target_dir])

Related

Python zlib-inflate alternative

How to properly compress and decompress a text file using bz2 and python

Using GZIP Module with Python

sorting file in place with Python on unix system

Python: Closing and removing files

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unzipping large files using python - python

Another solution for large .zip files (works on Ubuntu 16.04.4). First install 7z: sudo apt-get install p7zip-full Then in your python code, call 7zip with: import subprocess subprocess.call(['7z', 'x', src_file, '-o'+target_dir])

Related

Python zlib-inflate alternative

How to *properly* compress and decompress a text file using bz2 and python

Using GZIP Module with Python

sorting file in place with Python on unix system

Python: Closing and removing files

Categories

Resources

How to properly compress and decompress a text file using bz2 and python