Unzip part of a file using python gzip module

Unzip part of a file using python gzip module - python

I am trying to unzip a gzipped file in Python using the gzip module. The pre-condition is that, I get 160 bytesof data at a time, and I need to unzip it before I request for the next 160 bytes. Partial unzipping is OK, before requesting the next 160 bytes. The code I have is
import gzip
import time
import StringIO
file = open('input_cp.gz', 'rb')
buf = file.read(160)
sio = StringIO.StringIO(buf)
f = gzip.GzipFile(fileobj=sio)
data = f.read()
print data
The error I am getting is IOError: CRC check failed. I am assuming this is cuz it expects the entire gzipped content to be present in buf, whereas I am reading in only 160 bytes at a time. Is there a workaround this??
Thanks

Create your own class with a read() method (and whatever else GzipFile needs from fileobj, like close and seek) and pass it to GzipFile. Something like:
class MyBuffer(object):
def __init__(self, input_file):
self.input_file = input_file
def read(self, size=-1):
if size < 0:
size = 160
return self.input_file.read(min(160, size))
Then use it like:
file = open('input_cp.gz', 'rb')
mybuf = MyBuffer(file)
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()

Related

How to rewrite a tensorboard file? (checksum error)

I can't find anywhere how to open read and rewrite a tensorboard file (without the dependency of tensorflow).
Using the following code yields a mysterious checksum error:
Do you know what is causing this and how I can rewrite the file?
(If you print event it looks normal.)
Here is the code:
import sys
import struct
import tensorboard.compat.proto.event_pb2 as event_pb2
import mmap
log_file = sys.argv[1]
def read(data, offset):
header = struct.unpack_from('Q', data, offset)
event_str = data[offset+12:offset+12+int(header[0])]
return 12+int(header[0])+4+offset, event_str
with open(log_file, 'rb') as f:
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
offset = 0
while offset < len(data):
offset, event_str = read(data, offset)
event = event_pb2.Event()
event.ParseFromString(event_str)
with open('test.tfevents', 'ab') as w:
w.write(event.SerializeToString())

How to stream from ZipFile? How to zip "on the fly"?

I want to zip a stream and stream out the result. I'm doing it using AWS Lambda which matters in sense of available disk space and other restrictions.
I'm going to use the zipped stream to write an AWS S3 object using upload_fileobj() or put(), if it matters.
I can create an archive as a file until I have small objects:
import zipfile
zf = zipfile.ZipFile("/tmp/byte.zip", "w")
zf.writestr(filename, my_stream.read())
zf.close()
For large amount of data I can create an object instead of file:
from io import BytesIO
...
byte = BytesIO()
zf = zipfile.ZipFile(byte, "w")
....
but how can I pass the zipped stream to the output? If I use zf.close() - the stream will be closed, if I don't use it - the archive will be incomplete.

Instead of using Python't built-in zipfile, you can use stream-zip (full disclosure: written by me)
If you have an iterable of bytes, my_data_iter say, you can get an iterable of a zip file using its stream_zip function:
from datetime import datetime
from stream_zip import stream_zip, ZIP_64
def files():
modified_at = datetime.now()
perms = 0o600
yield 'my-file-1.txt', modified_at, perms, ZIP_64, my_data_iter
my_zip_iter = stream_zip(files())
If you need a file-like object, say to pass to boto3's upload_fileobj, you can convert from the iterable with a transformation function:
def to_file_like_obj(iterable):
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj:
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
return FileLikeObj()
my_file_like_obj = to_file_like_obj(my_zip_iter)

You might like to try the zipstream version of zipfile. For example, to compress stdin to stdout as a zip file holding the data as a file named TheLogFile using iterators:
#!/usr/bin/python3
import sys, zipstream
with zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED) as z:
z.write_iter('TheLogFile', sys.stdin.buffer)
for chunk in z:
sys.stdout.buffer.write(chunk)

boto get md5 s3 file

I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5 on the content string and compare it with the md5 of local file. So something like:
conn = S3Connection('access key', 'secretkey')
bucket = conn.get_bucket('bucket_name')
source_path = 'file_to_upload'
source_size = os.stat(source_path).st_size
mp = bucket.initiate_multipart_upload(os.path.basename(source_path))
chunk_size = 52428800
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:
mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes))
mp.complete_upload()
obj_key = bucket.get_key('file_name')
print(obj_key.md5) #prints None
print(obj_key.base64md5) #prints None
content = bucket.get_key('file_name').get_contents_as_string()
# compute the md5 on content
This approach is wasteful as it doubles the bandwidth usage. I tried
bucket.get_key('file_name').md5
bucket.get_key('file_name').base64md5
but both return None.
Is there any other way to achieve md5 without downloading the whole thing?

yes
use bucket.get_key('file_name').etag[1 :-1]
this way get key's MD5 without downloading it's contents.

With boto3, I use head_object to retrieve the ETag.
import boto3
import botocore
def s3_md5sum(bucket_name, resource_name):
try:
md5sum = boto3.client('s3').head_object(
Bucket=bucket_name,
Key=resource_name
)['ETag'][1:-1]
except botocore.exceptions.ClientError:
md5sum = None
pass
return md5sum

You can recover md5 without downloading the file, from e_tag attribute, like that:
boto3.resource('s3').Object(<BUCKET_NAME>, file_path).e_tag[1 :-1]
Then use this function to compare classic s3 files:
def md5_checksum(file_path):
m = hashlib.md5()
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
Or this function for multi-part files:
def etag_checksum(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5("".join(md5s))
return '{}-{}'.format(m.hexdigest(), len(md5s))
Finally use this function to choose between the two:
def md5_compare(file_path, s3_file_md5):
if '-' in s3_file_md5 and s3_file_md5 == etag_checksum(file_path):
return True
if '-' not in s3_file_md5 and s3_file_md5 == md5_checksum(file_path):
return True
print("MD5 not equals for file " + file_path)
return False
Credit to: https://zihao.me/post/calculating-etag-for-aws-s3-objects/

Since 2016, the best way to do this without any additional object retrievals is by presenting the --content-md5 argument during a PutObject request. AWS will then verify that the provided MD5 matches their calculated MD5. This also works for multipart uploads and objects >5GB.
An example call from the knowledge center:
aws s3api put-object --bucket awsexamplebucket --key awsexampleobject.txt --body awsexampleobjectpath --content-md5 examplemd5value1234567== --metadata md5checksum=examplemd5value1234567==
https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

How to read filenames included into a gz file

I've tried to read a gz file:
with open(os.path.join(storage_path,file), "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
It works but I need the filenames and the size of every file included into my gz file.
This code print out the content of the included file into the archive.
How can I read the filenames included into this gz file?

The Python gzip module does not provide access to that information.
The source code skips over it without ever storing it:
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = self.fileobj.read(1)
if not s or s=='\000':
break
The filename component is optional, not guaranteed to be present (the commandline gzip -c decompression option would use the original filename sans .gz in that case, I think). The uncompressed filesize is not stored in the header; you can find it in the last four bytes instead.
To read the filename from the header yourself, you'd need to recreate the file header reading code, and retain the filename bytes instead. The following function returns that, plus the decompressed size:
import struct
from gzip import FEXTRA, FNAME
def read_gzip_info(gzipfile):
gf = gzipfile.fileobj
pos = gf.tell()
# Read archive size
gf.seek(-4, 2)
size = struct.unpack('<I', gf.read())[0]
gf.seek(0)
magic = gf.read(2)
if magic != '\037\213':
raise IOError('Not a gzipped file')
method, flag, mtime = struct.unpack("<BBIxx", gf.read(8))
if not flag & FNAME:
# Not stored in the header, use the filename sans .gz
gf.seek(pos)
fname = gzipfile.name
if fname.endswith('.gz'):
fname = fname[:-3]
return fname, size
if flag & FEXTRA:
# Read & discard the extra field, if present
gf.read(struct.unpack("<H", gf.read(2)))
# Read a null-terminated string containing the filename
fname = []
while True:
s = gf.read(1)
if not s or s=='\000':
break
fname.append(s)
gf.seek(pos)
return ''.join(fname), size
Use the above function with an already-created gzip.GzipFile object:
filename, size = read_gzip_info(gzipfileobj)

GzipFile itself doesn't have this information, but:
The file name is (usually) the name of the archive minus the .gz
If the uncompressed file is smaller than 4G, then the last four bytes of the archive contain the uncompressed size:
In [14]: f = open('fuse-ext2-0.0.7.tar.gz')
In [15]: f.seek(-4, 2)
In [16]: import struct
In [17]: r = f.read()
In [18]: struct.unpack('<I', r)[0]
Out[18]: 7106560
In [19]: len(gzip.open('fuse-ext2-0.0.7.tar.gz').read())
Out[19]: 7106560
(technically, the last four bytes are the size of the original (uncompressed) input data modulo 232 (the ISIZE field in the member trailer, http://www.gzip.org/zlib/rfc-gzip.html))

i've solved in this mode:
fl = search_files(storage_path)
for f in fl:
with open(os.path.join(storage_path,f), "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
print str(storage_path) + "/" + str(f[:-3]) + " : " + str(len(data)) + " bytes" #pcap file size
i don't know if it's correct.
Any suggest?

the new code:
fl = search_files(storage_path)
for f in fl:
with open(os.path.join(storage_path,f), "rb") as gzipfile:
#try with module 2^32
gzipfile.seek(-4,2)
r = gzipfile.read()
print str(storage_path) + "/" + str(f[:-3]) + " : " + str(struct.unpack('<I' ,r)[0]) + " bytes" #dimensione del file pcap

The solution of Martjin is really nice, I've packaged it for Python 3.6+: https://github.com/PierreSelim/gzinfo
Juste need to pip install gzinfo
in your code
import gzinfo
info = gzinfo.read_gz_info('bar.txt.gz')
# info.name is 'foo.txt'
print(info.fname)

Python ungzipping stream of bytes?

Here is the situation:
I get gzipped xml documents from Amazon S3
import boto
from boto.s3.connection import S3Connection
from boto.s3.key import Key
conn = S3Connection('access Id', 'secret access key')
b = conn.get_bucket('mydev.myorg')
k = Key(b)
k.key('documents/document.xml.gz')
I read them in file as
import gzip
f = open('/tmp/p', 'w')
k.get_file(f)
f.close()
r = gzip.open('/tmp/p', 'rb')
file_content = r.read()
r.close()
Question
How can I ungzip the streams directly and read the contents?
I do not want to create temp files, they don't look good.

Yes, you can use the zlib module to decompress byte streams:
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
The offset of 32 signals to the zlib header that the gzip header is expected but skipped.
The S3 key object is an iterator, so you can do:
for data in stream_gzip_decompress(k):
# do something with the decompressed data

I had to do the same thing and this is how I did it:
import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()

For Python3x and boto3-
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()

You can try PIPE and read contents without downloading file
import subprocess
c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for row in c.stdout:
print row
In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unzip part of a file using python gzip module - python

Related

How to rewrite a tensorboard file? (checksum error)

How to stream from ZipFile? How to zip "on the fly"?

boto get md5 s3 file

How to read filenames included into a gz file

Python ungzipping stream of bytes?

Categories

Resources