Here is the situation:
I get gzipped xml documents from Amazon S3
import boto
from boto.s3.connection import S3Connection
from boto.s3.key import Key
conn = S3Connection('access Id', 'secret access key')
b = conn.get_bucket('mydev.myorg')
k = Key(b)
k.key('documents/document.xml.gz')
I read them in file as
import gzip
f = open('/tmp/p', 'w')
k.get_file(f)
f.close()
r = gzip.open('/tmp/p', 'rb')
file_content = r.read()
r.close()
Question
How can I ungzip the streams directly and read the contents?
I do not want to create temp files, they don't look good.
Yes, you can use the zlib module to decompress byte streams:
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
The offset of 32 signals to the zlib header that the gzip header is expected but skipped.
The S3 key object is an iterator, so you can do:
for data in stream_gzip_decompress(k):
# do something with the decompressed data
I had to do the same thing and this is how I did it:
import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()
For Python3x and boto3-
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
You can try PIPE and read contents without downloading file
import subprocess
c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for row in c.stdout:
print row
In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.
Related
I have to write in Python that performs the following tasks:
1- Download the Movielens datasets from the url ‘http://files.grouplens.org/datasets/movielens/ml-
25m.zip’
2- Download the Movielens checksum from the url ‘http://files.grouplens.org/datasets/movielens/ml-
25m.zip.md5’
3- Check whether the checksum of the archive corresponds to the downloaded one
4- In case of positive check, print the names of the files contained by the downloaded archive
This is what I wrote up to now:
from zipfile import ZipFile
from urllib import request
import hashlib
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
url_datasets = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip'
datasets = 'datasets.zip'
url_checksum = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip.md5'
request.urlretrieve( url_datasets, datasets)
request.urlretrieve (url_checksum, checksum)
checksum = 'datasets.zip.md5'
with ZipFile(datasets, 'r') as zipObj:
listOfiles = zipObj.namelist()
for elem in listOfiles:
print(elem)
So what I'm missing is a way to compare the checksum I computed with the one I downloaded and maybe I can create a function "printFiles" that checks the checksum and in the positive case prints the list of files.
Is there something else I can improve?
Your code isn't actually making any of the requests.
from zipfile import ZipFile
import hashlib
import requests
def md5(fname):
hash_md5 = hashlib.md5()
hash_md5.update( open(fname,'rb').read() )
return hash_md5.hexdigest()
url_datasets = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip'
datasets = 'datasets.zip'
url_checksum = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip.md5'
checksum = 'datasets.zip.md5'
ds = requests.get( url_datasets, allow_redirects=True)
cs = requests.get( url_checksum, allow_redirects=True)
open( datasets, 'wb').write( ds.content )
ds_md5 = md5(datasets)
cs_md5 = cs.content.decode('utf-8').split()[0]
print( ds_md5 )
print( cs_md5 )
if ds_md5 == cs_md5:
print( "MATCH" )
with ZipFile(datasets, 'r') as zipObj:
listOfiles = zipObj.namelist()
for elem in listOfiles:
print(elem)
else:
print( "Checksum fail" )
I want to zip a stream and stream out the result. I'm doing it using AWS Lambda which matters in sense of available disk space and other restrictions.
I'm going to use the zipped stream to write an AWS S3 object using upload_fileobj() or put(), if it matters.
I can create an archive as a file until I have small objects:
import zipfile
zf = zipfile.ZipFile("/tmp/byte.zip", "w")
zf.writestr(filename, my_stream.read())
zf.close()
For large amount of data I can create an object instead of file:
from io import BytesIO
...
byte = BytesIO()
zf = zipfile.ZipFile(byte, "w")
....
but how can I pass the zipped stream to the output? If I use zf.close() - the stream will be closed, if I don't use it - the archive will be incomplete.
Instead of using Python't built-in zipfile, you can use stream-zip (full disclosure: written by me)
If you have an iterable of bytes, my_data_iter say, you can get an iterable of a zip file using its stream_zip function:
from datetime import datetime
from stream_zip import stream_zip, ZIP_64
def files():
modified_at = datetime.now()
perms = 0o600
yield 'my-file-1.txt', modified_at, perms, ZIP_64, my_data_iter
my_zip_iter = stream_zip(files())
If you need a file-like object, say to pass to boto3's upload_fileobj, you can convert from the iterable with a transformation function:
def to_file_like_obj(iterable):
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj:
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
return FileLikeObj()
my_file_like_obj = to_file_like_obj(my_zip_iter)
You might like to try the zipstream version of zipfile. For example, to compress stdin to stdout as a zip file holding the data as a file named TheLogFile using iterators:
#!/usr/bin/python3
import sys, zipstream
with zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED) as z:
z.write_iter('TheLogFile', sys.stdin.buffer)
for chunk in z:
sys.stdout.buffer.write(chunk)
I am currently trying to write a dataframe to a temp file and then upload that temp file into an S3 bucket. When I run my code there currently isn't any action that occurs. Any help would be greatly appreciated. The following is my code:
import csv
import pandas as pd
import boto3
import tempfile
import os
s3 = boto3.client('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key, region_name = region)
temp = tempfile.TemporaryFile()
largedf.to_csv(temp, sep = '|')
s3.put_object(temp, Bucket = '[BUCKET NAME]', Key = 'test.txt')
temp.close()
The file-handle you pass to the s3.put_object is at the final position, when you .read from it, it will return an empty string.
>>> df = pd.DataFrame(np.random.randint(10,50, (5,5)))
>>> temp = tempfile.TemporaryFile(mode='w+')
>>> df.to_csv(temp)
>>> temp.read()
''
A quick fix is to .seek back to the beginning...
>>> temp.seek(0)
0
>>> print(temp.read())
,0,1,2,3,4
0,11,42,40,45,11
1,36,18,45,24,25
2,28,20,12,33,44
3,45,39,14,16,20
4,40,16,22,30,37
Note, writing to disk is unnecessary, really, you could just keep everything in memory using a buffer, something like:
from io import StringIO # on python 2, use from cStringIO import StringIO
buffer = StringIO()
# Saving df to memory as a temporary file
df.to_csv(buffer)
buffer.seek(0)
s3.put_object(buffer, Bucket = '[BUCKET NAME]', Key = 'test.txt')
How do I write an in memory zipfile to a file?
# Create in memory zip and add files
zf = zipfile.ZipFile(StringIO.StringIO(), mode='w',compression=zipfile.ZIP_DEFLATED)
zf.writestr('file1.txt', "hi")
zf.writestr('file2.txt', "hi")
# Need to write it out
f = file("C:/path/my_zip.zip", "w")
f.write(zf) # what to do here? Also tried f.write(zf.read())
f.close()
zf.close()
StringIO.getvalue return content of StringIO:
>>> import StringIO
>>> f = StringIO.StringIO()
>>> f.write('asdf')
>>> f.getvalue()
'asdf'
Alternatively, you can change position of the file using seek:
>>> f.read()
''
>>> f.seek(0)
>>> f.read()
'asdf'
Try following:
mf = StringIO.StringIO()
with zipfile.ZipFile(mf, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
zf.writestr('file1.txt', "hi")
zf.writestr('file2.txt', "hi")
with open("C:/path/my_zip.zip", "wb") as f: # use `wb` mode
f.write(mf.getvalue())
Modify falsetru's answer for python3
1) use io.StringIO instead of StringIO.StringIO
StringIO in python3
2) use b"abc" instead of "abc" , or
python 3.5: TypeError: a bytes-like object is required, not 'str' when writing to a file
3) encode to binary string str.encode(s, "utf-8")
Best way to convert string to bytes in Python 3?
import zipfile
import io
mf = io.BytesIO()
with zipfile.ZipFile(mf, mode="w",compression=zipfile.ZIP_DEFLATED) as zf:
zf.writestr('file1.txt', b"hi")
zf.writestr('file2.txt', str.encode("hi"))
zf.writestr('file3.txt', str.encode("hi",'utf-8'))
with open("a.txt.zip", "wb") as f: # use `wb` mode
f.write(mf.getvalue())
This should also work for gzip: How do I gzip compress a string in Python?
with ZipFile(read_file, 'r') as zipread:
with ZipFile(file_write_buffer, 'w', ZIP_DEFLATED) as zipwrite:
for item in zipread.infolist():
# Copy all ZipInfo attributes for each file since defaults are not preseved
dest.CRC = item.CRC
dest.date_time = item.date_time
dest.create_system = item.create_system
dest.compress_type = item.compress_type
dest.external_attr = item.external_attr
dest.compress_size = item.compress_size
dest.file_size = item.file_size
dest.header_offset = item.header_offset
In the case where the zip file reads corrupted and you notice missing symlinks or corrupted files with wrong timestamps, it could be the fact that the file properties are not getting copied over.
The above code snippet is how I solved the problem.
I am trying to unzip a gzipped file in Python using the gzip module. The pre-condition is that, I get 160 bytesof data at a time, and I need to unzip it before I request for the next 160 bytes. Partial unzipping is OK, before requesting the next 160 bytes. The code I have is
import gzip
import time
import StringIO
file = open('input_cp.gz', 'rb')
buf = file.read(160)
sio = StringIO.StringIO(buf)
f = gzip.GzipFile(fileobj=sio)
data = f.read()
print data
The error I am getting is IOError: CRC check failed. I am assuming this is cuz it expects the entire gzipped content to be present in buf, whereas I am reading in only 160 bytes at a time. Is there a workaround this??
Thanks
Create your own class with a read() method (and whatever else GzipFile needs from fileobj, like close and seek) and pass it to GzipFile. Something like:
class MyBuffer(object):
def __init__(self, input_file):
self.input_file = input_file
def read(self, size=-1):
if size < 0:
size = 160
return self.input_file.read(min(160, size))
Then use it like:
file = open('input_cp.gz', 'rb')
mybuf = MyBuffer(file)
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()