How do I automatically handle decompression when reading a file in Python? - python

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:
for f in files:
handle = open(f)
process_file_contents(handle)
Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)

I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.
In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.
import gzip
import bz2
def open_by_suffix(filename):
if filename.endswith('.gz'):
return gzip.open(filename, 'rb')
elif filename.endswith('.bz2'):
return bz2.BZ2file(filename, 'r')
else:
return open(filename, 'r')
If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):
import gzip
import bz2
magic_dict = {
"\x1f\x8b\x08": (gzip.open, 'rb')
"\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)
def open_by_magic(filename):
with open(filename) as f:
file_start = f.read(max_len)
for magic, (fn, flag) in magic_dict.items():
if file_start.startswith(magic):
return fn(filename, flag)
return open(filename, 'r')
Usage:
# cat
for filename in filenames:
with open_by_suffix(filename) as f:
for line in f:
print f
Your use-case would look like:
for f in files:
with open_by_suffix(f) as handle:
process_file_contents(handle)

Related

Converting a .csv.gz to .csv in Python 2.7

I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.

Extract gzip file without BOM in Python 3.6

I have multiple gzfile in subfolders that I want to unzip in one folder. It works fine but there's a BOM signature at the beginning of each file that I would like to be removed. I have checked other questions like Removing BOM from gzip'ed CSV in Python or Convert UTF-8 with BOM to UTF-8 with no BOM in Python but it doesn't seem to work. I use Python 3.6 in Pycharm on Windows.
Here's first my code without attempt:
import gzip
import pickle
import glob
def save_object(obj, filename):
with open(filename, 'wb') as output: # Overwrites any existing file.
pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
output_path = 'path_out'
i = 1
for filename in glob.iglob(
'path_in/**/*.gz', recursive=True):
print(filename)
with gzip.open(filename, 'rb') as f:
file_content = f.read()
new_file = output_path + "z" + str(i) + ".txt"
save_object(file_content, new_file)
f.close()
i += 1
Now, with the logic defined in Removing BOM from gzip'ed CSV in Python (at least what I understand of it) if I replace file_content = f.read() by file_content = csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines()), I get:
TypeError: can't pickle _csv.reader objects
I checked for this error (e.g. "Can't pickle <type '_csv.reader'>" error when using multiprocessing on Windows) but I found no solution I could apply.
A minor adaptation of the very first question you link to trivially works.
tripleee$ cat bomgz.py
import gzip
from subprocess import run
with open('bom.txt', 'w') as handle:
handle.write('\ufeffmoo!\n')
run(['gzip', 'bom.txt'])
with gzip.open('bom.txt.gz', 'rb') as f:
file_content = f.read().decode('utf-8-sig')
with open('nobom.txt', 'w') as output:
output.write(file_content)
tripleee$ python3 bomgz.py
tripleee$ gzip -dc bom.txt.gz | xxd
00000000: efbb bf6d 6f6f 210a ...moo!.
tripleee$ xxd nobom.txt
00000000: 6d6f 6f21 0a moo!.
The pickle parts didn't seem relevant here but might have been obscuring the goal of getting a block of decoded str out of an encoded blob of bytes.

Sending multiple .CSV files to .ZIP without storing to disk in Python

I'm working on a reporting application for my Django powered website. I want to run several reports and have each report generate a .csv file in memory that can be downloaded in batch as a .zip. I would like to do this without storing any files to disk. So far, to generate a single .csv file, I am following the common operation:
mem_file = StringIO.StringIO()
writer = csv.writer(mem_file)
writer.writerow(["My content", my_value])
mem_file.seek(0)
response = HttpResponse(mem_file, content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=my_file.csv'
This works fine, but only for a single, unzipped .csv. If I had, for example, a list of .csv files created with a StringIO stream:
firstFile = StringIO.StringIO()
# write some data to the file
secondFile = StringIO.StringIO()
# write some data to the file
thirdFile = StringIO.StringIO()
# write some data to the file
myFiles = [firstFile, secondFile, thirdFile]
How could I return a compressed file that contains all objects in myFiles and can be properly unzipped to reveal three .csv files?
zipfile is a standard library module that does exactly what you're looking for. For your use-case, the meat and potatoes is a method called "writestr" that takes a name of a file and the data contained within it that you'd like to zip.
In the code below, I've used a sequential naming scheme for the files when they're unzipped, but this can be switched to whatever you'd like.
import zipfile
import StringIO
zipped_file = StringIO.StringIO()
with zipfile.ZipFile(zipped_file, 'w') as zip:
for i, file in enumerate(files):
file.seek(0)
zip.writestr("{}.csv".format(i), file.read())
zipped_file.seek(0)
If you want to future-proof your code (hint hint Python 3 hint hint), you might want to switch over to using io.BytesIO instead of StringIO, since Python 3 is all about the bytes. Another bonus is that explicit seeks are not necessary with io.BytesIO before reads (I haven't tested this behavior with Django's HttpResponse, so I've left that final seek in there just in case).
import io
import zipfile
zipped_file = io.BytesIO()
with zipfile.ZipFile(zipped_file, 'w') as f:
for i, file in enumerate(files):
f.writestr("{}.csv".format(i), file.getvalue())
zipped_file.seek(0)
The stdlib comes with the module zipfile, and the main class, ZipFile, accepts a file or file-like object:
from zipfile import ZipFile
temp_file = StringIO.StringIO()
zipped = ZipFile(temp_file, 'w')
# create temp csv_files = [(name1, data1), (name2, data2), ... ]
for name, data in csv_files:
data.seek(0)
zipped.writestr(name, data.read())
zipped.close()
temp_file.seek(0)
# etc. etc.
I'm not a user of StringIO so I may have the seek and read out of place, but hopefully you get the idea.
def zipFiles(files):
outfile = StringIO() # io.BytesIO() for python 3
with zipfile.ZipFile(outfile, 'w') as zf:
for n, f in enumarate(files):
zf.writestr("{}.csv".format(n), f.getvalue())
return outfile.getvalue()
zipped_file = zip_files(myfiles)
response = HttpResponse(zipped_file, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename=my_file.zip'
StringIO has getvalue method which return the entire contents. You can compress the zipfile
by zipfile.ZipFile(outfile, 'w', zipfile.ZIP_DEFLATED). Default value of compression is ZIP_STORED which will create zip file without compressing.

gzip a file quicker using Python?

I am attempting to gzip a file using python faster as some of my files are as as small as 30 MB and as large as 4 GB.
Is there a more efficient way of creating a gzip file than the following? Is there a way to optimize the following so that if a file is small enough to be placed in memory it can simply just read the whole chunk of the file to be read rather than do it on a per line basis?
with open(j, 'rb') as f_in:
with gzip.open(j + ".gz", 'wb') as f_out:
f_out.writelines(f_in)
Copy the file in bigger chunks using the shutil.copyfileobj() function. In this example, I'm using 16MiB blocks which is pretty reasonable.
MEG = 2**20
with open(j, 'rb') as f_in:
with gzip.open(j + ".gz", 'wb') as f_out:
shutil.copyfileobj(f_in, f_out, length=16*MEG)
You may find that calling out to gzip is faster for large files, especially if you plan to zip multiple files in parallel.
Instead of reading it line by line, you can read it at once.
Example:
import gzip
with open(j, 'rb') as f_in:
content = f_in.read()
f = gzip.open(j + '.gz', 'wb')
f.write(content)
f.close()
Find 2 almost identical methods for reading gzip files below:
A.) to load everything into memory --> can be a bad choice for very big files (several GB), because you can run out of memory
B.) Don't load everything into memory, line by line --> good for BIG files
adapted from
https://codebright.wordpress.com/2011/03/25/139/
and
https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/
http://pastebin.com/dcEJRs1i
import sys
if sys.version.startswith("3"):
import io
io_method = io.BytesIO
else:
import cStringIO
io_method = cStringIO.StringIO
A.)
def yield_line_gz_file(fn):
"""
:param fn: String (absolute path)
:return: GeneratorFunction (yields String)
"""
ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
fh = io_method(ph.communicate()[0])
for line in fh:
yield line
B.)
def yield_line_gz_file(fn):
"""
:param fn: String (absolute path)
:return: GeneratorFunction (yields String)
"""
ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
for line in ph.stdout:
yield line

Python zipfile, bizarre limit to number of files: "folder is invalid"

The computer is toying with me, I know it!
I am creating a zip folder in Python. The individual files are generated in memory and then the whole thing is zipped and saved to a file. I am allowed to add 9 files to the zip. I am allowed to add 11 files to the zip. But 10, no, not 10 files. The zip file IS saved to my computer, but I'm not allowed to open it; Windows says that the compressed zipped folder is invalid.
I use the code below, which I got from another stackoverflow question. It appends 10 files and saves the zipped folder. When I click on the folder, I cannot extract it. BUT, remove one of the appends() and it's fine. Or, add another append and it works!
What am I missing here? How can I make this work every time?
imz = InMemoryZip()
imz.append("1a.txt", "a").append("2a.txt", "a").append("3a.txt", "a").append("4a.txt", "a").append("5a.txt", "a").append("6a.txt", "a").append("7a.txt", "a").append("8a.txt", "a").append("9a.txt", "a").append("10a.txt", "a")
imz.writetofile("C:/path/test.zip")
import zipfile
import StringIO
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object
self.in_memory_zip = StringIO.StringIO()
def append(self, filename_in_zip, file_contents):
'''Appends a file with name filename_in_zip and contents of
file_contents to the in-memory zip.'''
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
'''Returns a string with the contents of the in-memory zip.'''
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, "w")
f.write(self.read())
f.close()
You should use the 'wb' mode when creating the file you are saving to the file system. This will ensure that the file is written in binary.
Otherwise, any time a newline (\n) character happens to be encountered in the zip file python will replace it to match the windows line ending (\r\n). The reason 10 files is a problem is that 10 happens to be the code for \n.
So your write function should look like this:
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, 'wb')
f.write(self.read())
f.close()
This should fix your problem and work for the files in your example. Although, in your case you might find it easier to write the zip file directly to the file system like this code which includes some of the comments from above:
import StringIO
import zipfile
class ZipCreator:
buffer = None
def __init__(self, fileName=None):
if fileName:
self.zipFile = zipfile.ZipFile(fileName, 'w', zipfile.ZIP_DEFLATED, False)
return
self.buffer = StringIO.StringIO()
self.zipFile = zipfile.ZipFile(self.buffer, 'w', zipfile.ZIP_DEFLATED, False)
def addToZipFromFileSystem(self, filePath, filenameInZip):
self.zipFile.write(filePath, filenameInZip)
def addToZipFromMemory(self, filenameInZip, fileContents):
self.zipFile.writestr(filenameInZip, fileContents)
for zipFile in self.zipFile.filelist:
zipFile.create_system = 0
def write(self, fileName):
if not self.buffer: # If the buffer was not initialized the file is written by the ZipFile
self.zipFile.close()
return
f = file(fileName, 'wb')
f.write(self.buffer.getvalue())
f.close()
# Use File Handle
zipCreator = ZipCreator('C:/path/test.zip')
# Use Memory Buffer
# zipCreator = ZipCreator()
for i in range(1, 10):
zipCreator.addToZipFromMemory('test/%sa.txt' % i, 'a')
zipCreator.write('C:/path/test.zip')
Ideally, you would probably use separate classes for an in-memory zip and a zip that is tied to the file system from the beginning. I have also seem some issues with the in-memory zip when folders are added which are difficult to recreate and which I am still trying to track down.

Categories

Resources