gzip.read32 method not available in Python 3 [duplicate] - python

I have a .gz file and I need to get the name of files inside it using python.
This question is the same as this one
The only difference is that my file is .gz not .tar.gz so the tarfile library did not help me here
I am using requests library to request a URL. The response is a compressed file.
Here is the code I am using to download the file
response = requests.get(line.rstrip(), stream=True)
if response.status_code == 200:
with open(str(base_output_dir)+"/"+str(current_dir)+"/"+str(count)+".gz", 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
This code downloads the file with name 1.gz for example. Now if I opened the file with an archive manger the file will contain something like my_latest_data.json
I need to extract the file and the output be my_latest_data.json.
Here is the code I am using to extract the file
inF = gzip.open(f, 'rb')
outfilename = f.split(".")[0]
outF = open(outfilename, 'wb')
outF.write(inF.read())
inF.close()
outF.close()
The outputfilename variable is a string I provide in the script but I need the real file name (my_latest_data.json)

You can't, because Gzip is not an archive format.
That's a bit of a crap explanation on its own, so let me break this down a bit more than I did in the comment...
Its just compression
Being "just a compression system" means that Gzip operates on input bytes (usually from a file) and outputs compressed bytes. You cannot know whether or not the bytes inside represent multiple files or just a single file -- it is just a stream of bytes that has been compressed. That is why you can accept gzipped data over a network, for example. Its bytes_in -> bytes_out.
What's a manifest?
A manifest is a header within an archive that acts as a table of contents for the archive. Note that now I am using the term "archive" and not "compressed stream of bytes". An archive implies that it is a collection of files or segments that are referred to by a manifest -- a compressed stream of bytes is just a stream of bytes.
What's inside a Gzip, anyway?
A somewhat simplified description of a .gz file's contents is:
A header with a special number to indicate its a gzip, a version and a timestamp (10 bytes)
Optional headers; usually including the original filename (if the compression target was a file)
The body -- some compressed payload
A CRC-32 checksum at the end (8 bytes)
That's it. No manifest.
Archive formats, on the other hand, will have a manifest inside. That's where the tar library would come in. Tar is just a way to shove a bunch of bits together into a single file, and places a manifest at the front that lets you know the names of the original files and what sizes they were before being concatenated into the archive. Hence, .tar.gz being so common.
There are utilities that allow you to decompress parts of a gzipped file at a time, or decompress it only in memory to then let you examine a manifest or whatever that may be inside. But the details of any manifest are specific to the archive format contained inside.
Note that this is different from a zip archive. Zip is an archive format, and as such contains a manifest. Gzip is a compression library, like bzip2 and friends.

As noted in the other answer, your question can only make sense if I take out the plural: "I have a .gz file and I need to get the name of file inside it using python."
A gzip header may or may not have a file name in it. The gzip utility will normally ignore the name in the header, and decompress to a file with the same name as the .gz file, but with the .gz stripped. E.g. your 1.gz would decompress to a file named 1, even if the header has the file name my_latest_data.json in it. The -N option of gzip will use the file name in the header (as well as the time stamp in the header), if there is one. So gzip -dN 1.gz would create the file my_latest_data.json, instead of 1.
You can find the file name in the header in Python by processing the header manually. You can find the details in the gzip specification.
Verify that the first three bytes are 1f 8b 08.
Save the fourth byte. Call it flags. If flags & 8 is zero, then give up -- there is no file name in the header.
Skip the next six bytes.
If flags & 2 is not zero, skip two bytes.
If flags & 4 is not zero, then read the next two bytes. Considering them to be in little endian order, make an integer out of those two bytes, calling it xlen. Then skip xlen bytes.
We already know that flags & 8 is not zero, so you are now at the file name. Read bytes until you get to zero byte. Those bytes up to, but not including the zero byte are the file name.

Note: This answer is obsolete as of Python 3.
Using the tips from the Mark Adler reply and a bit of inspection on gzip module I've set up this function that extracts the internal filename from gzip files. I noticed that GzipFile objects have a private method called _read_gzip_header() that almost gets the filename so i did based on that
import gzip
def get_gzip_filename(filepath):
f = gzip.open(filepath)
f._read_gzip_header()
f.fileobj.seek(0)
f.fileobj.read(3)
flag = ord(f.fileobj.read(1))
mtime = gzip.read32(f.fileobj)
f.fileobj.read(2)
if flag & gzip.FEXTRA:
# Read & discard the extra field, if present
xlen = ord(f.fileobj.read(1))
xlen = xlen + 256*ord(f.fileobj.read(1))
f.fileobj.read(xlen)
filename = ''
if flag & gzip.FNAME:
while True:
s = f.fileobj.read(1)
if not s or s=='\000':
break
else:
filename += s
return filename or None

The Python 3 gzip library discards this information but you could adopt the code from around the link to do something else with it.
As noted in other answers on this page, this information is optional anyway. But it's not impossible to retrieve if you need to look if it's there.
import struct
def gzinfo(filename):
# Copy+paste from gzip.py line 16
FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT = 1, 2, 4, 8, 16
with open(filename, 'rb') as fp:
# Basically copy+paste from GzipFile module line 429f
magic = fp.read(2)
if magic == b'':
return False
if magic != b'\037\213':
raise ValueError('Not a gzipped file (%r)' % magic)
method, flag, _last_mtime = struct.unpack("<BBIxx", fp.read(8))
if method != 8:
raise ValueError('Unknown compression method')
if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", fp.read(2))
fp.read(extra_len)
if flag & FNAME:
fname = []
while True:
s = fp.read(1)
if not s or s==b'\000':
break
fname.append(s.decode('latin-1'))
return ''.join(fname)
def main():
from sys import argv
for filename in argv[1:]:
print(filename, gzinfo(filename))
if __name__ == '__main__':
main()
This replaces the exceptions in the original code with a vague ValueError exception (you might want to fix that if you intend to use this more broadly, and turn this into a proper module you can import) and uses the generic read() function instead of the specific _read_exact() method which goes through some trouble to ensure that it got exactly the number of bytes it requested (this too could be lifted over if you wanted to).

Related

Read large file header (~9GB) inside tarfile without full extraction

I have ~1GB *.tbz files. Inside each of those files there is a single ~9GB file. I just need to read the header of this file, the first 1024 bytes.
I want this to do this as fast as possible as I have hundreds of this 1GB files I want to process. It takes about 1m30s to extract.
I tried using full extraction:
tar = tarfile.open(fn, mode='r|bz2')
for item in tar:
tar.extract(item)
and tarfile.getmembers() but with no speed imprevement:
tar = tarfile.open(fn, mode='r|bz2')
for member in tar.getmembers():
f = tar.extractfile(member)
headerbytes = f.read(1024)
headerdict = parseHeader(headerbytes)
The getmembers() method is what's taking all the time there.
Is there any way I can to this?
I think you should use the standard library bz2 interface. .tbz is the file extension for tar files that are compressed with the -j option to specify a bzip2 format.
As #bbayles pointed out in the comments, you can open your file as a bz2.BZ2File and use seek and read:
read([size])
Read at most size uncompressed bytes, returned as a
string. If the size argument is negative or omitted, read until EOF is
reached.
seek(offset[, whence])
Move to new file position. Argument offset is a
byte count.
f = bz2.BZ2File(path)
f.seek(512)
headerbytes = f.read(1024)
You can then parse that with your functions.
headerdict = parseHeader(headerbytes)
If you're sure that every tar archive will contain only a single bz2 file, you can simply skip the first 512 bytes when first reading the tar file (NOT the bz2 file contained in it, of course), because the tar file format has a padded (fixed size) header, after which your "real" content is stored.
A simple
f.seek(512)
instead of looping over getmembers() should do the trick.

Extracting bz2 file with single file in memory

I have a csv file compressed into a bz2 file that I'm trying to load from a website, decompress, and write to a local csv file by
# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)
On the decompress call, I get IOError: invalid data stream. As a note, the csv file contained in the archive has quite a few characters that could be causing some issues. Particularly, if I try putting the file contents in unicode, I get an error about not being able to decode 0xfd. I only have the single file within the archive, but I'm wondering if something could also be going on due to not extracting a specific file.
Any ideas?
I suspect you are getting this error because the stream you are feeding the decompress() function is not a valid bz2 stream.
You must also "rewind" your StringIO buffer after writing to it. See the notes below in comments. The following code (same as yours with the exception of imports, and the seek() fix) works if the URL points to a valid bz2 file.
from StringIO import StringIO
import urllib2
import bz2
# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2" # just an example bz2 file
archive = StringIO()
# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)
archive.write(url_data.read())
# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()
# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)
# Extract the training data
data = bz2.decompress(archive.read())
# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)
re: encoding issues
Generally, character encoding errors will generate UnicodeError (or one of its cousins), but not IOError. IOError suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely.
You have omitted the imports from your question, and one of the subtle differences between the StringIO and cStringIO (according to the docs ) is that cStringIO cannot work with unicode strings that cannot be converted to ascii. That no longer seems to hold (in my tests at least), but it may be at play.
Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

Python failed to parse txt file but the file is confirmed to be 'txt' file

I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly.
For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0
It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters.
And it will further get converted to something like this:
[['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',......
However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.
The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.
Each line is read as {{{ for line in self._infile.xreadlines(): }}}
I am very confused why it would behave this way.
My python code is following.
def __init__(self, infile=sys.stdin, outfile=sys.stdout):
if isinstance(infile, basestring):
infile = open(infile)
if isinstance(outfile, basestring):
outfile = open(outfile, "w")
self._infile = infile
self._outfile = outfile
def sort(self):
lines = []
last_second = None
for line in self._infile.xreadlines():
line = line.replace('\r\n', '')
fields = line.split(',')
if len(fields) < 2:
continue
second = fields[1]
if last_second and second != last_second:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
lines = []
last_second = second
lines.append(fields)
if lines:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
self._infile.close()
self._outfile.close()
The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).
I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.
Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):
def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
with open(filename, "rb") as f:
in_bytes = f.read() # read bytes
text = in_bytes.decode(from_encoding) # decode to unicode
out_bytes = text.encode(to_encoding) # reencode to new encoding
with open(filename, "wb") as f:
f.write(out_bytes) # write back to the file
If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).

Reading memory mapped bzip2 compressed file

So I'm playing with the Wikipedia dump file. It's an XML file that has been bzipped. I can write all the files to directories, but then when I want to do analysis, I have to reread all the files on the disk. This gives me random access, but it's slow. I have the ram to put the entire bzipped file into ram.
I can load the dump file just fine and read all the lines, but I cannot seek in it as it's gigantic. From what it seems, the bz2 library has to read and capture the offset before it can bring me there (and decompress it all, as the offset is in decompressed bytes).
Anyway, I'm trying to mmap the dump file (~9.5 gigs) and load it into bzip. I obviously want to test this on a bzip file before.
I want to map the mmap file to a BZ2File so I can seek through it (to get to a specific, uncompressed byte offset), but from what it seems, this is impossible without decompressing the entire mmap file (this would be well over 30 gigabytes).
Do I have any options?
Here's some code I wrote to test.
import bz2
import mmap
lines = '''This is my first line
This is the second
And the third
'''
with open("bz2TestFile", "wb") as f:
f.write(bz2.compress(lines))
with open("bz2TestFile", "rb") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
print "Part of MMAPPED"
# This does not work until I hit a minimum length
# due to (I believe) the checksums in the bz2 algorithm
#
for x in range(len(mapped)+2):
line = mapped[0:x]
try:
print x
print bz2.decompress(line)
except:
pass
# I can decompress the entire mmapped file
print ":entire mmap file:"
print bz2.decompress(mapped)
# I can create a bz2File object from the file path
# Is there a way to map the mmap object to this function?
print ":BZ2 File readline:"
bzF = bz2.BZ2File("bz2TestFile")
# Seek to specific offset
bzF.seek(22)
# Read the data
print bzF.readline()
This all makes me wonder though, what is special about the bz2 file object that allows it to read a line after seeking? Does it have to read every line before it to get the checksums from the algorithm to work out correctly?
I found an answer! James Taylor wrote a couple scripts for seeking in BZ2 files, and his scripts are in the biopython module.
https://bitbucket.org/james_taylor/bx-python/overview
These work pretty well, although they do not allow for seeking to arbitrary byte offsets in the BZ2 file, his scripts read out blocks of BZ2 data and allow seeking based on blocks.
In particular, see bx-python / wiki / IO / SeekingInBzip2Files

Read binary file with header using bitarray's from file in Python

I wrote a program that uses bitarray 0.8.0 to write bits to a binary file. I would like to add a header to this binary file to describe what's inside the file.
My problem is that I think the method "fromfile" of bitarray necessarily starts reading the file from the beginning. I could make a workaround so that the reading program gets the header and then rewrite a temporary file containing only the binary portion (bitarray tofile), but it doesn't sound too efficient of an idea.
Is there any way to do this properly?
My file could look something like the following where clear text is the header and binary data is the bitarray information:
...{(0, 0): '0'}{(0, 0): '0'}{(0, 0): '0'}���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������...
Edit:
I tried the following after reading the response:
bits = ""
b = bitarray()
with open(Filename, 'rb') as file:
#Get header
byte = file.read(1)
while byte != "":
# read header
byte = file.read(1)
b.fromfile(file)
print b.to01()
print "len(b.to01())", len(b.to01())
The length is 0 and the print of "to01()" is empty.
However, the print of the header is fine.
My problem is that I think the method "fromfile" of bitarray necessarily starts reading the file from the beginning.
This is likely false; it, like most other file read routines, probably starts at the current position within the file, and stops at EOF.
EDIT:
From the documentation:
fromfile(f, [n])
Read n bytes from the file object f and append them to the bitarray interpreted as machine values. When n is omitted, as many bytes are read until EOF is reached.

Categories

Resources