Fast reading of gzip (text file) using io.BufferedReader in Python 3 - python

I'm trying to efficiently read in, and parse, a compressed text file using the gzip module. This link suggests wrapping the gzip file object with io.BufferedReader, like so:
import gzip, io
gz = gzip.open(in_path, 'rb')
f = io.BufferedReader(gz)
for line in f.readlines():
# do stuff
gz.close()
To do this in Python 3, I think gzip must be called with mode='rb'. So the result is that line is a binary string. However, I need line to be a text/ascii string. Is there a more efficient way to read in the file as a text string using BufferedReader, or will I have to decode line inside the for loop?

You can use io.TextIOWrapper to seamlessly wrap a binary stream to a text stream instead:
f = io.TextIOWrapper(gz)
Or as #ShadowRanger pointed out, you can simply open the gzip file in text mode instead, so that the gzip module will apply the io.TextIOWrapper wrapper for you:
for line in gzip.open(in_path, 'rt'):
# do stuff

Related

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

Convert file into BytesIO object using python

I have a file and want to convert it into BytesIO object so that it can be stored in database's varbinary column.
Please can anyone help me convert it using python.
Below is my code:
f = open(filepath, "rb")
print(f.read())
myBytesIO = io.BytesIO(f)
myBytesIO.seek(0)
print(type(myBytesIO))
Opening a file with open and mode read-binary already gives you a Binary I/O object.
Documentation:
The easiest way to create a binary stream is with open() with 'b' in the mode string:
f = open("myfile.jpg", "rb")
So in normal circumstances, you'd be fine just passing the file handle wherever you need to supply it. If you really want/need to get a BytesIO instance, just pass the bytes you've read from the file when creating your BytesIO instance like so:
from io import BytesIO
with open(filepath, "rb") as fh:
buf = BytesIO(fh.read())
This has the disadvantage of loading the entire file into memory, which might be avoidable if the code you're passing the instance to is smart enough to stream the file without keeping it in memory. Note that the example uses open as a context manager that will reliably close the file, even in case of errors.

Flush data written to numeric file handle?

How can I flush the content written to a file opened as a numeric file handle?
For illustration, one can do the following in Python:
f = open(fn, 'w')
f.write('Something')
f.flush()
On the contrary, I am missing a method when doing the following:
import os
fd = os.open(fn)
os.pwrite(fd, buffer, offset)
# How do I flush fd here?
Use os.fsync(fd). See docs for fsync.
Be careful if you do fsync on a file descriptor obtained from a python file object. In that case you need to flush the python file object first.

How to compress a processed text file in Python?

I have a text file which I constantly append data to. When processing is done I need to gzip the file. I tried several options like shutil.make_archive, tarfile, gzip but could not eventually do it. Is there no simple way to compress a file without actually writing to it?
Let's say I have mydata.txt file and I want it to be gzipped and saved as mydata.txt.gz.
I don't see the problem. You should be able to use e.g. the gzip module just fine, something like this:
inf = open("mydata.txt", "rb")
outf = gzip.open("file.txt.gz", "wb")
outf.write(inf.read())
outf.close()
inf.close()
There's no problem with the file being overwritten, the name given to gzip.open() is completely independent of the name given to plain open().
If you want to compress a file without writing to it, you could run a shell command such as gzip using the Python libraries subprocess or popen or os.system.

Python gzip refuses to read uncompressed file

I seem to remember that the Python gzip module previously allowed you to read non-gzipped files transparently. This was really useful, as it allowed to read an input file whether or not it was gzipped. You simply didn't have to worry about it.
Now,I get an IOError exception (in Python 2.7.5):
Traceback (most recent call last):
File "tst.py", line 14, in <module>
rec = fd.readline()
File "/sw/lib/python2.7/gzip.py", line 455, in readline
c = self.read(readsize)
File "/sw/lib/python2.7/gzip.py", line 261, in read
self._read(readsize)
File "/sw/lib/python2.7/gzip.py", line 296, in _read
self._read_gzip_header()
File "/sw/lib/python2.7/gzip.py", line 190, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
If anyone has a neat trick, I'd like to hear about it. Yes, I know how to catch the exception, but I find it rather clunky to first read a line, then close the file and open it again.
The best solution for this would be to use something like https://github.com/ahupp/python-magic with libmagic. You simply cannot avoid at least reading a header to identify a file (unless you implicitly trust file extensions)
If you're feeling spartan the magic number for identifying gzip(1) files is the first two bytes being 0x1f 0x8b.
In [1]: f = open('foo.html.gz')
In [2]: print `f.read(2)`
'\x1f\x8b'
gzip.open is just a wrapper around GzipFile, you could have a function like this that just returns the correct type of object depending on what the source is without having to open the file twice:
#!/usr/bin/python
import gzip
def opener(filename):
f = open(filename,'rb')
if (f.read(2) == '\x1f\x8b'):
f.seek(0)
return gzip.GzipFile(fileobj=f)
else:
f.seek(0)
return f
Maybe you're thinking of zless or zgrep, which will open compressed or uncompressed files without complaining.
Can you trust that the file name ends in .gz?
if file_name.endswith('.gz'):
opener = gzip.open
else:
opener = open
with opener(file_name, 'r') as f:
...
Read the first four bytes. If the first three are 0x1f, 0x8b, 0x08, and if the high three bits of the fourth byte are zeros, then fire up the gzip compression starting with those four bytes. Otherwise write out the four bytes and continue to read transparently.
You should still have the clunky solution to back that up, so that if the gzip read fails nevertheless, then back up and read transparently. But it should be quite unlikely to have the first four bytes mimic a gzip file so well, but not be a gzip file.
You can iterate over files transparently using fileinput(files, openhook=fileinput.hook_compressed)

Categories

Resources