How to gzip while uploading into s3 using boto

How to gzip while uploading into s3 using boto - python

I have a large local file. I want to upload a gzipped version of that file into S3 using the boto library. The file is too large to gzip it efficiently on disk prior to uploading, so it should be gzipped in a streamed way during the upload.
The boto library knows a function set_contents_from_file() which expects a file-like object it will read from.
The gzip library knows the class GzipFile which can get an object via the parameter named fileobj; it will write to this object when compressing.
I'd like to combine these two functions, but the one API wants to read by itself, the other API wants to write by itself; neither knows a passive operation (like being written to or being read from).
Does anybody have an idea on how to combine these in a working fashion?
EDIT: I accepted one answer (see below) because it hinted me on where to go, but if you have the same problem, you might find my own answer (also below) more helpful, because I implemented a solution using multipart uploads in it.

I implemented the solution hinted at in the comments of the accepted answer by garnaat:
import cStringIO
import gzip
def sendFileGz(bucket, key, fileName, suffix='.gz'):
key += suffix
mpu = bucket.initiate_multipart_upload(key)
stream = cStringIO.StringIO()
compressor = gzip.GzipFile(fileobj=stream, mode='w')
def uploadPart(partCount=[0]):
partCount[0] += 1
stream.seek(0)
mpu.upload_part_from_file(stream, partCount[0])
stream.seek(0)
stream.truncate()
with file(fileName) as inputFile:
while True: # until EOF
chunk = inputFile.read(8192)
if not chunk: # EOF?
compressor.close()
uploadPart()
mpu.complete_upload()
break
compressor.write(chunk)
if stream.tell() > 10<<20: # min size for multipart upload is 5242880
uploadPart()
It seems to work without problems. And after all, streaming is in most cases just a chunking of the data. In this case, the chunks are about 10MB large, but who cares? As long as we aren't talking about several GB chunks, I'm fine with this.
Update for Python 3:
from io import BytesIO
import gzip
def sendFileGz(bucket, key, fileName, suffix='.gz'):
key += suffix
mpu = bucket.initiate_multipart_upload(key)
stream = BytesIO()
compressor = gzip.GzipFile(fileobj=stream, mode='w')
def uploadPart(partCount=[0]):
partCount[0] += 1
stream.seek(0)
mpu.upload_part_from_file(stream, partCount[0])
stream.seek(0)
stream.truncate()
with open(fileName, "rb") as inputFile:
while True: # until EOF
chunk = inputFile.read(8192)
if not chunk: # EOF?
compressor.close()
uploadPart()
mpu.complete_upload()
break
compressor.write(chunk)
if stream.tell() > 10<<20: # min size for multipart upload is 5242880
uploadPart()

You can also compress Bytes with gzip easily and upload it as the following easily:
import gzip
import boto3
cred = boto3.Session().get_credentials()
s3client = boto3.client('s3',
aws_access_key_id=cred.access_key,
aws_secret_access_key=cred.secret_key,
aws_session_token=cred.token
)
bucketname = 'my-bucket-name'
key = 'filename.gz'
s_in = b"Lots of content here"
gzip_object = gzip.compress(s_in)
s3client.put_object(Bucket=bucket, Body=gzip_object, Key=key)
It is possible to replace s_in by any Bytes, io.BytesIO, pickle dumps, files, etc.
If you want to upload compressed Json then here is a nice example: Upload compressed Json to S3

There really isn't a way to do this because S3 doesn't support true streaming input (i.e. chunked transfer encoding). You must know the Content-Length prior to upload and the only way to know that is to have performed the gzip operation first.

Related

Google Cloud Storage streaming upload from Python generator

I have a Python generator that will yield a large and unknown amount of byte data. I'd like to stream the output to GCS, without buffering to a file on disk first.
While I'm sure this is possible (e.g., I can create a subprocess of gsutil cp - <...> and just write my bytes into its stdin), I'm not sure what's a recommended/supported way and the documentation gives the example of uploading a local file.
How should I do this right?

The BlobWriter class makes this a bit easier:
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = BlobWriter(blob)
for d in your_generator:
writer.write(d)
writer.close()

send gzip data without unzipping

I am currently working on a script for RaspberryPi using a SIM module to send data to an FTP server. Problem is, some data are quite large and I formatted them into csv files but still, are a bit large to send through GPRS. By compressing them in gz files it reduces the size by 5 which is great, but in order to send data, the only way is to send data line by line. I was wondering if there was a way to send the information of a gzip file without sending the uncompressed data. Here is my code so far:
list_of_files = glob.glob('/home/pi/src/git/RPI/DATA/*.gz')
print(list_of_files)
for file_data in list_of_files:
zipp = gzip.GzipFile(file_data,'rb')
file_content = zipp.read()
#array = np.fromstring(file_content, dtype='f4')
print(len(file_content))
#AT commands to send the file_content to FTP server
Here the length returned is the length of the uncompressed data, but i want to be able to retrieve the uncompressed value of the gzip file? Is it doable?
Thanks for your help.

zipp = gzip.GzipFile(file_data,'rb')
specifically requests unzipping. If you just want to read the bare raw binary gzip data, use a regular open:
zipp = open(file_data,'rb')
You don't need to read the file into memory to fetch its size, though. The os.stat function lets you get information about a file's metadata without opening it.

Streaming decompression of S3 gzip source object to a S3 destination object using python?

Given a large gzip object in S3, what is a memory efficient (e.g. streaming) method in python3/boto3 to decompress the data and store the results back into another S3 object?
There is a similar question previously asked. However, all of the answers use a methodology in which the contents of the gzip file are first read into memory (e.g. ByteIO). These solutions are not viable for objects that are too big to fit in main memory.
For large S3 objects the contents need to be read, decompressed "on the fly", and then written to a different S3 object is some chunked fashion.
Thank you in advance for your consideration and response.

You can use streaming methods with boto / s3 but you have to define your own file-like objects AFAIK.
Luckily there's smart_open which handles that for you; it also supports GCS, Azure, HDFS, SFTP and others.
Here's an example using a large sample of sales data:
import boto3
from smart_open import open
session = boto3.Session() # you need to set auth credentials here if you don't have them set in your environment
chunk_size = 1024 * 1024 # 1 MB
f_in = open("s3://mybucket/2m_sales_records.csv.gz", transport_params=dict(session=session), encoding="utf-8")
f_out = open("s3://mybucket/2m_sales_records.csv", "w", transport_params=dict(session=session))
byte_count = 0
while True:
data = f_in.read(chunk_size)
if not data:
break
f_out.write(data)
byte_count += len(data)
print(f"wrote {byte_count} bytes so far")
f_in.close()
f_out.close()
The sample file has 2 million lines and it's 75 MB compressed and 238 MB uncompressed.
I uploaded the compressed file to mybucket and ran the code which downloaded the file, extracted the contents in memory and uploaded the uncompressed data back to S3.
On my computer the process took around 78 seconds (highly dependent on Internet connection speed) and never used more than 95 MB of memory; I think you can lower the memory requirements if need be by overriding the part size for S3 multipart uploads in smart_open.
DEFAULT_MIN_PART_SIZE = 50 * 1024**2
"""Default minimum part size for S3 multipart uploads"""
MIN_MIN_PART_SIZE = 5 * 1024 ** 2
"""The absolute minimum permitted by Amazon."""

Decompression does not work for own file

I'm realativly new to the python programming language and i ran into a problem with the module zstandard.
I'm currently working with the replayfiles of Halite.
Since they are compressed with zstandard, i have to use this module. And if i read a file, everything is fine! I can decompress the ".hlt" files.
But i've done some transformations of the json data that i want to save on disk to use later. I find it very useful to store the data compressed again, so i used the compressor. The compression works fine, too. However, if i open the file i just created again, i get an error message reading: "zstd.ZstdError: decompression error: Unknown frame descriptor".
Have a look on my code below:
def getFileData(self, filename):
with open(filename, "rb") as file:
data = file.read()
return data
def saveDataToFile(self, filename, data):
with open(filename, "bw") as file:
file.write(data)
def transformCompressedToJson(self, data, beautify=0):
zd = ZstdDecompressor()
decompressed = zd.decompress(data, len(data))
return json.loads(decompressed)
def transformJsonToCompressed(self, jsonData, beautify=0):
zc = ZstdCompressor()
if beautify > 0:
jsonData = json.dumps(jsonData, sort_keys=True, indent=beautify)
objectCompressor = zc.compressobj()
compressed = objectCompressor.compress(jsonData.encode())
return objectCompressor.flush()
And i am using it here:
rp = ReplayParser()
gameDict = rp.parse('replays/replay-20180215-152416+0100--4209273584-160-160-278627.hlt')
compressed = rp.transformJsonToCompressed(json.dumps(gameDict, sort_keys=False, indent=0))
rp.saveDataToFile("test.cmp", compressed)
t = rp.getFileData('test.cmp')
j = rp.transformCompressedToJson(t) -> Here is the error
print(j)
The function rp.parse(..) just transforms the data - so it just creates a dictionary .. The rp.parse(..) function also calls transformCompressedToJson, so it is working fine for the hlt file.
Hopefully, you guys can help me with this.
Greethings,
Noixes

In transformJsonToCompressed(), you are throwing away the result of the .compress() method (which is likely going to be the bulk of the output data), and instead returning only the result of .flush() (which will just be the last little bit of data remaining in buffers). The normal way to use a compression library like this would be to write each chunk of compressed data directly to the output file as it is generated. Your code isn't structured to allow that (the function knows nothing about the file the data will be written to), so instead you could concatenate the two chunks of compressed data and return that.

How to generate a Zip from a set of streams and producing a stream with the Zip data?

I have an app with manages a set of files, but those files are actually stored in Rackspace's CloudFiles, because most of the files will be ~100GB. I'm using the Cloudfile's TempURL feature to allow individual files, but sometimes, the user will want to download a set of files. But downloading all those files and generating a local Zip file is impossible since the server only have 40GB of disk space.
From the user view, I want to implement it the way GMail does when you get an email with several pictures: It gives you a link to download a Zip file with all the images in it, and the download is immediate.
How to accomplish this with Python/Django? I have found ZipStream and looks promising because of the iterator output, but it still only accepts filepaths as arguments, and the writestr method would need to fetch all the file data at once (~100GB).

Since Python 3.5 it is possible to create zip chunks stream of huge files/folders. You can use the unseekable stream. So no need to use ZipStream now.
See my answer here.
And live example here: https://repl.it/#IvanErgunov/zipfilegenerator
If you don't have filepath, but have chunks of bytes you can exclude open(path, 'rb') as entry from example and replace iter(lambda: entry.read(16384), b'') with your iterable of bytes. And prepare ZipInfo manually:
zinfo = ZipInfo(filename='any-name-of-your-non-existent-file', date_time=time.localtime(time.time())[:6])
zinfo.compress_type = zipfile.ZIP_STORED
# permissions:
if zinfo.filename[-1] == '/':
# directory
zinfo.external_attr = 0o40775 << 16 # drwxrwxr-x
zinfo.external_attr |= 0x10 # MS-DOS directory flag
else:
# file
zinfo.external_attr = 0o600 << 16 # ?rw-------
You should also remember that the zipfile module writes chunks of its zipfile own size. So, if you send a piece of 512 bytes the stream will receive a piece of data only when and only with size the zipfile module decides to do it. It depends on the compression algorithm, but I think it is not a problem, because the zipfile module makes small chunks <= 16384.

You can use https://pypi.python.org/pypi/tubing. Here's an example using s3, you could pretty easily create a rackspace clouldfile Source. Create a customer Writer (instead of sinks.Objects) to stream the data some where else and custom Transformers to transform the stream.
from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
| pipes.Gunzip() \
| pipes.Split(on=b'\n') \
| sinks.Objects()
print len(output)

Check this out - it's part of the Python Standard Library:
http://docs.python.org/3/library/zipfile.html#zipfile-objects
You can give it an open file or file-like-object.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to gzip while uploading into s3 using boto - python

There really isn't a way to do this because S3 doesn't support true streaming input (i.e. chunked transfer encoding). You must know the Content-Length prior to upload and the only way to know that is to have performed the gzip operation first.

Related

Google Cloud Storage streaming upload from Python generator

send gzip data without unzipping

Streaming decompression of S3 gzip source object to a S3 destination object using python?

Decompression does not work for own file

How to generate a Zip from a set of streams and producing a stream with the Zip data?

Categories

Resources