Extracting bz2 file with single file in memory - python

I have a csv file compressed into a bz2 file that I'm trying to load from a website, decompress, and write to a local csv file by
# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)
On the decompress call, I get IOError: invalid data stream. As a note, the csv file contained in the archive has quite a few characters that could be causing some issues. Particularly, if I try putting the file contents in unicode, I get an error about not being able to decode 0xfd. I only have the single file within the archive, but I'm wondering if something could also be going on due to not extracting a specific file.
Any ideas?

I suspect you are getting this error because the stream you are feeding the decompress() function is not a valid bz2 stream.
You must also "rewind" your StringIO buffer after writing to it. See the notes below in comments. The following code (same as yours with the exception of imports, and the seek() fix) works if the URL points to a valid bz2 file.
from StringIO import StringIO
import urllib2
import bz2
# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2" # just an example bz2 file
archive = StringIO()
# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()
# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)
re: encoding issues
Generally, character encoding errors will generate UnicodeError (or one of its cousins), but not IOError. IOError suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely.
You have omitted the imports from your question, and one of the subtle differences between the StringIO and cStringIO (according to the docs ) is that cStringIO cannot work with unicode strings that cannot be converted to ascii. That no longer seems to hold (in my tests at least), but it may be at play.
Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

Related

gzip.read32 method not available in Python 3 [duplicate]

I have a .gz file and I need to get the name of files inside it using python.
This question is the same as this one
The only difference is that my file is .gz not .tar.gz so the tarfile library did not help me here
I am using requests library to request a URL. The response is a compressed file.
Here is the code I am using to download the file
response = requests.get(line.rstrip(), stream=True)
if response.status_code == 200:
with open(str(base_output_dir)+"/"+str(current_dir)+"/"+str(count)+".gz", 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
This code downloads the file with name 1.gz for example. Now if I opened the file with an archive manger the file will contain something like my_latest_data.json
I need to extract the file and the output be my_latest_data.json.
Here is the code I am using to extract the file
inF = gzip.open(f, 'rb')
outfilename = f.split(".")[0]
outF = open(outfilename, 'wb')
outF.write(inF.read())
inF.close()
outF.close()
The outputfilename variable is a string I provide in the script but I need the real file name (my_latest_data.json)
You can't, because Gzip is not an archive format.
That's a bit of a crap explanation on its own, so let me break this down a bit more than I did in the comment...
Its just compression
Being "just a compression system" means that Gzip operates on input bytes (usually from a file) and outputs compressed bytes. You cannot know whether or not the bytes inside represent multiple files or just a single file -- it is just a stream of bytes that has been compressed. That is why you can accept gzipped data over a network, for example. Its bytes_in -> bytes_out.
What's a manifest?
A manifest is a header within an archive that acts as a table of contents for the archive. Note that now I am using the term "archive" and not "compressed stream of bytes". An archive implies that it is a collection of files or segments that are referred to by a manifest -- a compressed stream of bytes is just a stream of bytes.
What's inside a Gzip, anyway?
A somewhat simplified description of a .gz file's contents is:
A header with a special number to indicate its a gzip, a version and a timestamp (10 bytes)
Optional headers; usually including the original filename (if the compression target was a file)
The body -- some compressed payload
A CRC-32 checksum at the end (8 bytes)
That's it. No manifest.
Archive formats, on the other hand, will have a manifest inside. That's where the tar library would come in. Tar is just a way to shove a bunch of bits together into a single file, and places a manifest at the front that lets you know the names of the original files and what sizes they were before being concatenated into the archive. Hence, .tar.gz being so common.
There are utilities that allow you to decompress parts of a gzipped file at a time, or decompress it only in memory to then let you examine a manifest or whatever that may be inside. But the details of any manifest are specific to the archive format contained inside.
Note that this is different from a zip archive. Zip is an archive format, and as such contains a manifest. Gzip is a compression library, like bzip2 and friends.
As noted in the other answer, your question can only make sense if I take out the plural: "I have a .gz file and I need to get the name of file inside it using python."
A gzip header may or may not have a file name in it. The gzip utility will normally ignore the name in the header, and decompress to a file with the same name as the .gz file, but with the .gz stripped. E.g. your 1.gz would decompress to a file named 1, even if the header has the file name my_latest_data.json in it. The -N option of gzip will use the file name in the header (as well as the time stamp in the header), if there is one. So gzip -dN 1.gz would create the file my_latest_data.json, instead of 1.
You can find the file name in the header in Python by processing the header manually. You can find the details in the gzip specification.
Verify that the first three bytes are 1f 8b 08.
Save the fourth byte. Call it flags. If flags & 8 is zero, then give up -- there is no file name in the header.
Skip the next six bytes.
If flags & 2 is not zero, skip two bytes.
If flags & 4 is not zero, then read the next two bytes. Considering them to be in little endian order, make an integer out of those two bytes, calling it xlen. Then skip xlen bytes.
We already know that flags & 8 is not zero, so you are now at the file name. Read bytes until you get to zero byte. Those bytes up to, but not including the zero byte are the file name.
Note: This answer is obsolete as of Python 3.
Using the tips from the Mark Adler reply and a bit of inspection on gzip module I've set up this function that extracts the internal filename from gzip files. I noticed that GzipFile objects have a private method called _read_gzip_header() that almost gets the filename so i did based on that
import gzip
def get_gzip_filename(filepath):
f = gzip.open(filepath)
f._read_gzip_header()
f.fileobj.seek(0)
f.fileobj.read(3)
flag = ord(f.fileobj.read(1))
mtime = gzip.read32(f.fileobj)
f.fileobj.read(2)
if flag & gzip.FEXTRA:
# Read & discard the extra field, if present
xlen = ord(f.fileobj.read(1))
xlen = xlen + 256*ord(f.fileobj.read(1))
f.fileobj.read(xlen)
filename = ''
if flag & gzip.FNAME:
while True:
s = f.fileobj.read(1)
if not s or s=='\000':
break
else:
filename += s
return filename or None
The Python 3 gzip library discards this information but you could adopt the code from around the link to do something else with it.
As noted in other answers on this page, this information is optional anyway. But it's not impossible to retrieve if you need to look if it's there.
import struct
def gzinfo(filename):
# Copy+paste from gzip.py line 16
FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT = 1, 2, 4, 8, 16
with open(filename, 'rb') as fp:
# Basically copy+paste from GzipFile module line 429f
magic = fp.read(2)
if magic == b'':
return False
if magic != b'\037\213':
raise ValueError('Not a gzipped file (%r)' % magic)
method, flag, _last_mtime = struct.unpack("<BBIxx", fp.read(8))
if method != 8:
raise ValueError('Unknown compression method')
if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", fp.read(2))
fp.read(extra_len)
if flag & FNAME:
fname = []
while True:
s = fp.read(1)
if not s or s==b'\000':
break
fname.append(s.decode('latin-1'))
return ''.join(fname)
def main():
from sys import argv
for filename in argv[1:]:
print(filename, gzinfo(filename))
if __name__ == '__main__':
main()
This replaces the exceptions in the original code with a vague ValueError exception (you might want to fix that if you intend to use this more broadly, and turn this into a proper module you can import) and uses the generic read() function instead of the specific _read_exact() method which goes through some trouble to ensure that it got exactly the number of bytes it requested (this too could be lifted over if you wanted to).

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

Error when using gzip on a file containing line breaks

I'm attempting to use python's gzip library to streamline some python scripts that create csv output files. I've tried a number of different methods of creating the gzip file, but no matter which method I've tried, I'm running into the same issue.
My python script runs successfully, but when I try to decompress the gzip file in Finder (using MacOS 10.15.6), I'm prompted with the following error:
Unable to expand "file.csv.gz" into "Documents". (Error 79 - Inappropriate file type or format.)
After some debugging, I've narrowed down the cause of the error to the file content containing line break (\n) characters.
This simple example code triggers the above error on gzip expansion:
import gzip
content = b'Id,Food\n1,Spam\n2,Eggs\n'
f = gzip.open('file.csv.gz', 'wb')
f.write(content)
f.close()
When I remove all \n characters from the content variable, everything works fine:
import gzip
content = b'Id,Food,1,Spam,2,Eggs'
f = gzip.open('file.csv.gz', 'wb')
f.write(content)
f.close()
Does gzip want me to use a different line break mechanism? I'm sure I'm missing some sort of foundational knowledge about gzip or binaries, so any info that helps get me back on track would be much appreciated.
It has nothing to do with Python's gzip. It is, arguably, a bug in macOS where it sometimes detects the resulting uncompressed data as an mtree by the Archive Utility, but then finds the uncompressed data violates the mtree format.
The solution is to not double-click to decompress. Use gzip to decompress.

Reading memory mapped bzip2 compressed file

So I'm playing with the Wikipedia dump file. It's an XML file that has been bzipped. I can write all the files to directories, but then when I want to do analysis, I have to reread all the files on the disk. This gives me random access, but it's slow. I have the ram to put the entire bzipped file into ram.
I can load the dump file just fine and read all the lines, but I cannot seek in it as it's gigantic. From what it seems, the bz2 library has to read and capture the offset before it can bring me there (and decompress it all, as the offset is in decompressed bytes).
Anyway, I'm trying to mmap the dump file (~9.5 gigs) and load it into bzip. I obviously want to test this on a bzip file before.
I want to map the mmap file to a BZ2File so I can seek through it (to get to a specific, uncompressed byte offset), but from what it seems, this is impossible without decompressing the entire mmap file (this would be well over 30 gigabytes).
Do I have any options?
Here's some code I wrote to test.
import bz2
import mmap
lines = '''This is my first line
This is the second
And the third
'''
with open("bz2TestFile", "wb") as f:
f.write(bz2.compress(lines))
with open("bz2TestFile", "rb") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
print "Part of MMAPPED"
# This does not work until I hit a minimum length
# due to (I believe) the checksums in the bz2 algorithm
#
for x in range(len(mapped)+2):
line = mapped[0:x]
try:
print x
print bz2.decompress(line)
except:
pass
# I can decompress the entire mmapped file
print ":entire mmap file:"
print bz2.decompress(mapped)
# I can create a bz2File object from the file path
# Is there a way to map the mmap object to this function?
print ":BZ2 File readline:"
bzF = bz2.BZ2File("bz2TestFile")
# Seek to specific offset
bzF.seek(22)
# Read the data
print bzF.readline()
This all makes me wonder though, what is special about the bz2 file object that allows it to read a line after seeking? Does it have to read every line before it to get the checksums from the algorithm to work out correctly?
I found an answer! James Taylor wrote a couple scripts for seeking in BZ2 files, and his scripts are in the biopython module.
https://bitbucket.org/james_taylor/bx-python/overview
These work pretty well, although they do not allow for seeking to arbitrary byte offsets in the BZ2 file, his scripts read out blocks of BZ2 data and allow seeking based on blocks.
In particular, see bx-python / wiki / IO / SeekingInBzip2Files

Write gzipped-already data into a file

I have a database with some of the data is binary (blob datatype in MySQL), which was actually webpages scrapped and gzipped. Now I want to extract them and write each record into a gzip file, which I'd assume to be doable - after all they are gzipped-data right?
The question is, however, how would I do that? By searching I could find a million of examples on how to write gzip file from original data, not gzipped one. Writing the gzipped string directly into a file doesn't result in a gzip file, not to mention I got a load of "ordinal not in range" exceptions.
Could you guys help? Thanks in advance. I'm a newbie to Python...
Edit: Here is the method I used:
def store_cache(self, content, news_id):
if not content:
return
# some of the records may contain normal data (not gzipp-ed), hence this try block
try:
content = self.gunzip(content)
except:
return
import gzip
with gzip.open('static/cache/%s' % (self.base36encode(news_id), ), 'wb') as f:
f.write(content)
f.close()
This causes an exception:
<type 'exceptions.UnicodeEncodeError'> at /migrate
'ascii' codec can't encode character u'\u1edb' in position 186: ordinal not in range(128)
And this is the innermost traceback:
E:\Python27\lib\gzip.py in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
You said it yourself: extract them and then write them into a gzip file. There is nothing special about writing "from gzipped data": you un-gzip the data to get the original data, and then write the original data as if it were original data (because it is). The documentation shows you how to do these things.
However, gzip is just a compression format, not an archive format. It is not built to handle multiple files, so you must use something else to create a single file from the multiple inputs. Typically this is done by making a tar archive which is then gzipped. You can do this in Python using the tarfile module. Since your data will come from gzip-decompression streams, you will want to use the TarFile.addfile(tarinfo, fileobj) method to add them to the archive. You should be able to use the gzip.GzipFile instance as the fileobj to add this way.

Categories

Resources