Reading individual bz2 files from a tar file

Reading individual bz2 files from a tar file - python

I'm trying to read many bz2 files within a tar file, a file has the following structure:
2013-01.tar
01\01\00\X.json.bz2\X.json
01\01\02\X.json.bz2\X.json
I'm able to get the filenames as follows:
import tarfile
tar = tarfile.open(filepath, 'r')
tar_members_names = [filename for filename in tar.getnames()]
# Side question: How would I only return files and no directories?
Which returns a list of the .bz2 files. Now I'm trying to extract them (temporarily) using:
inner_filename = tar_members_names[0]
t_extract = tar.extractfile(inner_filename)
The following code to extract the json file returns an error, however. How would I go about retrieving the JSON files line by line?
import bz2
txt = bz2.BZ2File(t_extract)
TypeError: coercing to Unicode: need string or buffer, ExFileObject found
txt = bz2.decompress(t_extract)
TypeError: must be convertible to a buffer, not ExFileObject
I've been unable to figure out how to return a buffer from the tar file instead of the current ExFileObject (how to convert it to a buffer?), any suggestions are greatly appreciated.

BZ2File expects a file name as first argument and you pass a file object (i.e. an object which has the same API as what Python returns for open()).
To do what you want, you'll have to read all the bytes from t_extract yourself and call bz2.decompress(data) or use BZ2Decompressor to stream the data through it.

Related

Python: how to pass a file from a zip to a function that reads data from that file

I have a zip-file that contains .nrrd type files. The pynrrd lib comes with a read function. How can I pull the .nrrd file from the zip and pass it to the nrrd.read() function?
I tried following, but that gives the following error at the nrrd.read() line:
TypeError was unhandled by user code, file() argument 1 must be
encoded string without NULL bytes, not str
in_dir = r'D:\Temp\Slikvideo\JPEG\SV_4_1_mask'
zip_file = 'Annotated.mitk'
zf = zipfile.ZipFile(in_dir + '\\' + zip_file)
f_name = 'datafile.nrrd' # .nrrd file in zip
file_nrrd = zf.read(f_name) # pull the file from the zip
img_nrrd, options = nrrd.read(file_nrrd) # read the .nrrd image data from the file
I could write the file pulled from the .zip to disk, and then read it from disk with nrrd.read() but I am sure there is a better way.

I think that your is a good way...
Here there is a similar question:
Similar question
Plus answer:
I think that the problem maybe is that when you use zipfile.ZipFile you not set the attribute:
Try using:
zipfile.ZipFile (path,"r")

The following works:
file_nrrd = zf.extract(f_name) # extract the file from the zip

Read large file header (~9GB) inside tarfile without full extraction

I have ~1GB *.tbz files. Inside each of those files there is a single ~9GB file. I just need to read the header of this file, the first 1024 bytes.
I want this to do this as fast as possible as I have hundreds of this 1GB files I want to process. It takes about 1m30s to extract.
I tried using full extraction:
tar = tarfile.open(fn, mode='r|bz2')
for item in tar:
tar.extract(item)
and tarfile.getmembers() but with no speed imprevement:
tar = tarfile.open(fn, mode='r|bz2')
for member in tar.getmembers():
f = tar.extractfile(member)
headerbytes = f.read(1024)
headerdict = parseHeader(headerbytes)
The getmembers() method is what's taking all the time there.
Is there any way I can to this?

I think you should use the standard library bz2 interface. .tbz is the file extension for tar files that are compressed with the -j option to specify a bzip2 format.
As #bbayles pointed out in the comments, you can open your file as a bz2.BZ2File and use seek and read:
read([size])
Read at most size uncompressed bytes, returned as a
string. If the size argument is negative or omitted, read until EOF is
reached.
seek(offset[, whence])
Move to new file position. Argument offset is a
byte count.
f = bz2.BZ2File(path)
f.seek(512)
headerbytes = f.read(1024)
You can then parse that with your functions.
headerdict = parseHeader(headerbytes)

If you're sure that every tar archive will contain only a single bz2 file, you can simply skip the first 512 bytes when first reading the tar file (NOT the bz2 file contained in it, of course), because the tar file format has a padded (fixed size) header, after which your "real" content is stored.
A simple
f.seek(512)
instead of looping over getmembers() should do the trick.

Create zip file from in memory file in python

I am retrieving files from S3 bucket using the following code an it works fine.
file = io.BytesIO()
k.get_contents_to_file(file)
Now I want to add this in memory file to a zip file. The code below takes filename as argument but I have an in memory file.
zip_file.write(filename, zip_path)
I am using python 3.4 for my project.

Try to use writestr
Signature: writestr(zinfo_or_arcname, data, compress_type=None)
Docstring: Write a file into the archive. The contents is 'data',
which may be either a 'str' or a 'bytes' instance; if it is a 'str',
it is encoded as UTF-8 first. 'zinfo_or_arcname' is either a ZipInfo
instance or the name of the file in the archive.

how to check if a file is a .gz file in Python

I am working on python input-output and was given a CSV file(possible gzipped)
. If it is gzipped, I have to decompress it, and then read it.
I was trying to read the first two bytes do like this:
def func(filename):
fi = open(filenam,"rb")
byte1 = fi.read(1)
byte2 = fi.read(1)
then I will check byte1 and byte2 to see if they are equal to 0x1f and 0x8b, then decompress it then print every line of it.
But when I run it, I got this error:
TypeError: 'NoneType' object is not iterable
I'm new to python, can anyone help?

Understanding from what you said in the comment - "that's all I have in the function" I would assume the issue is that the function has no return value. So probably the caller of the function tries to run on the result of a function call with no return value, i.e NoneType.

you need to use endwith() in Python to check whether a folder has .gz extension file then use gzip module to decompress it and read .gz contents
import os
import gzip
for file in os.listdir(r"C:\Directory_name"):
if file.endswith(".gz"):
print file
os.chdir(r"C:\Directory_name")
f = gzip.open(file, 'rb')
file_content = f.read()
f.close()
so here "file_content" variable will hold the data of your csv gzipped file

Validating a zip file coming from stdin

After some frustration with unzip(1L), I've been trying to create a script that will unzip and print out raw data from all of the files inside a zip archive that is coming from stdin. I currently have the following, which works:
import sys, zipfile, StringIO
stdin = StringIO.StringIO(sys.stdin.read())
zipselect = zipfile.ZipFile(stdin)
filelist = zipselect.namelist()
for filename in filelist:
print filename, ':'
print zipselect.read(filename)
When I try to add validation to check if it truly is a zip file, however, it doesn't like it.
...
zipcheck = zipfile.is_zipfile(zipselect)
if zipcheck is not None:
print 'Input is not a zip file.'
sys.exit(1)
...
results in
File "/home/chris/simple/zipcat/zipcat.py", line 13, in <module>
zipcheck = zipfile.is_zipfile(zipselect)
File "/usr/lib/python2.7/zipfile.py", line 149, in is_zipfile
result = _check_zipfile(fp=filename)
File "/usr/lib/python2.7/zipfile.py", line 135, in _check_zipfile
if _EndRecData(fp):
File "/usr/lib/python2.7/zipfile.py", line 203, in _EndRecData
fpin.seek(0, 2)
AttributeError: ZipFile instance has no attribute 'seek'
I assume it can't seek because it is not a file, as such?
Sorry if this is obvious, this is my first 'go' with Python.

You should pass stdin to is_zipfile, not zipselect. is_zipfile takes a path to a file or a file object, not a ZipFile.
See the zipfile.is_zipfile documentation
You are correct that a ZipFile can't seek because it isn't a file. It's an archive, so it can contain many files.

To do this entirely in memory will take some work. The AttributeError message means that the is_zipfile method is trying to use the seek method of the file handle you provide. But standard input is not seekable, and therefore your file object for it has no seek method.
If you really, really can't store the file on disk temporarily, then you could buffer the entire file in memory (you would need to enforce a size limit for security), and then implement some "duck" code that looks and acts like a seekable file object but really just uses the byte-string in memory.
It is possible that you could cheat and buffer only enough of the data for is_zipfile to do its work, but I seem to recall that the table-of-contents for ZIP is at the end of the file. I could be wrong about that though.

Your 2011 python2 fragment was: StringIO.StringIO(sys.stdin.read())
In 2018 a python3 programmer might phrase that as: io.StringIO(...).
What you wanted was the following python3 fragment: io.BytesIO(...).
Certainly that works well for me when using the requests module to download binary ZIP files from webservers:
zf = zipfile.ZipFile(io.BytesIO(req.content))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading individual bz2 files from a tar file - python

Related

Python: how to pass a file from a zip to a function that reads data from that file

Read large file header (~9GB) inside tarfile without full extraction

Create zip file from in memory file in python

how to check if a file is a .gz file in Python

Validating a zip file coming from stdin

Categories

Resources