Read gzip file from s3 bucket - python

Hey I'm trying to read gzip file from s3 bucket, and here's my try:
s3client = boto3.client(
's3',
region_name='us-east-1'
)
bucketname = 'wind-obj'
file_to_read = '20190101_0000.gz'
fileobj = s3client.get_object(
Bucket=bucketname,
Key=file_to_read
)
filedata = fileobj['Body'].read()
And now to open gzip file I'm doing like:
gzip.open(filedata,'rb')
but it's throwing me error:
ValueError: embedded null byte
So I'm trying to decode it first:
contents = filedata.decode('utf-8')
which is throwing another error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I have tried decoding it using ISO-8859-1 and then it's getting decode but again while opening gzip file it's giving same error.
Or is there any other way using which I can pull the data from S3 like using URL or something?

gzip.open expects a filename or an already opened file object, but you are passing it the downloaded data directly. Try using gzip.decompress instead:
filedata = fileobj['Body'].read()
uncompressed = gzip.decompress(filedata)

Related

why do i get a decode error when using json load in python?

I try to open a json file but get a decode error. I can't find the solution for this. How can i decode this data?
The code gives the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 3765: invalid start byte
import json
url = 'users.json'
with open(url) as json_data:
data = json.load(json_data)
That means that the data you're trying to decode isn't encoded in UTF-8
EDIT:
You may decode it before loading it with json using something like this:
with open(url, 'rb') as f:
data = f.read()
data_str = data.decode("utf-8", errors='ignore')
json.load(data_str)
https://www.tutorialspoint.com/python/string_decode.htm
Be careful that you WILL lose some data during this process. A safer way would be to use the same decoding mechanism used to encode your JSON file, or to put raw data bytes in something like base64

How to return file contents from controller?

I'm trying to return the contents of an image file via a Python Connexion application generated from an OpenAPI v2 spec file using swagger-codegen and the python-flask language setting. In my controller module, I simply do the following:
def file_contents_get(file_id):
file = app.datastore.get_instance().get_file(file_id)
with open(file.path, "rb") as f:
return f.read()
However, this results in the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the proper way to return a file's contents? Note that I don't want the file as an attachment but rather inline.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte - even though I opened the file in mode 'rb'

I'm trying to write an HTTP server, but it doesn't matter.
When I try to decode an image data (after writing 'data = file.read()', it gives an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I opened the file in 'rb' mode.
Other people usually open the file in 'r' mode and that causes the error. But what is the error here?
What is the problem???
def get_content_file(file_path):
"""
Gets a full path to a file and returns the content of it.
file_path must be a valid path.
:param file_path: str (path)
:return: str (data)
"""
print(file_path)
file = open(file_path, 'rb')
data = file.read()
file.close()
return data.decode()
I'll suggest that you confirm the encoding format of 'file_path'. Download and open the file with Notepad++, check the lower right corner; there you can see whether your file was encoded in the compatible format, or if it has the Byte Order Marker or BOM sign, if either of these is true, simply 'save as' -the correct/required format.

How to decompress a GZIP file pulled from SFTP in Python3 the same way Mac OS's gunzip does it?

Okay, I've been stuck on this one for hours which should have only taken a few minutes of work.
I have the following code which pulls a gzipped CSV file from a datastore:
from ftplib import FTP_TLS
import gzip
import csv
ftps = FTP_TLS('waws-prod.net')
ftps.login(user='foo', passwd='bar')
resp = ftps.retrbinary('RETR data/WFSIV0606201701.700.csv.gz', gzip.open('WFSIV0606201701.700.csv.gz', 'wb').write)
The file appears in the pwd, and I can even open my Mac Decompression tool, and the original CSV is decompressed perfectly.
However, if I try to decompress this file in using the gzip Library, i can't get a UTF8 encoded string to parse:
f=gzip.GzipFile('WFSIV0606201701.700.csv.gz', 'rb')
s = f.read()
I get what appears to be UTF8 bytestrings, however utf8 decoder can't parse the string.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
BUT! If i download directly from the SFTP server using FileZilla, and i do run the gzip.GzipFile code above, it reads it perfectly. Something must be wrong with my downloader/reader but i haven't a clue as to what could be wrong.
resp = ftps.retrbinary('RETR data/WFSIV0606201701.700.csv.gz', gzip.open('WFSIV0606201701.700.csv.gz', 'wb').write)
This line downloads a compressed file, and then compresses it again when writing it to disk.
Replace gzip.open(...).write with open(...).write to write the compressed file directly.

Parse text file from content-type=application/zip and base64 encoding in AWS SES

On amazon SES, I have a rule to save incoming emails to S3 buckets. Amazon saves these in MIME format.
These emails have a .txt in attachment that will be shown in the MIME file as content-type=text/plain, Content-Disposition=attachment ... .txt, and Content-Transfer-Encoding=quoted-printable or bases64.
I am able to parse it fine using python.
I have a problem decoding the content of the .txt file attachment when this is compressed (i.e., content-type: applcation/zip), as if the encoding wasn't base64.
My code:
import base64
s = unicode(base64.b64decode(attachment_content), "utf-8")
throws the error:
Traceback (most recent call last):
File "<input>", line 796, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcf in position 10: invalid continuation byte
Below are the first few lines of the "base64" string in attachment_content, which btw has length 53683 + "==" at the end, and I thought that the length of a base64 should be a multiple of 4 (??).
So maybe the decoding is failing because the compression is changing attachment_content and I need some other operation before/after decoding it? I have really no idea..
UEsDBBQAAAAIAM9Ah0otgkpwx5oAADMTAgAJAAAAX2NoYXQudHh0tL3bjiRJkiX23sD+g0U3iOxu
REWGu8c1l2Ag8lKd0V2ZWajM3kLuC6Hubu5uFeZm3nYJL6+n4T4Ry8EOdwCSMyQXBRBLgMQ+7CP5
QPBj5gdYn0CRI6JqFxWv7hlyszursiJV1G6qonI5cmQyeT6dPp9cnCaT6Yvp5Yvz6xfJe7cp8P/k
1SbL8xfJu0OSvUvr2q3TOnFVWjxrknWZFeuk2VRlu978s19MRvNMrHneOv51SOZlGUtMLYnfp0nd
...
I have also tried used "latin-1", but get gibberish.
The problem was that, after conversion, I was dealing with a zipped file in format, like "PK \x03 \x04 \X3C \Xa \x0c ...", and I needed to unzip it before transforming it to UTF-8 unicode.
This code worked for me:
import email
# Parse results from email
received_email = email.message_from_string(email_text)
for part in received_email.walk():
c_type = part.get_content_type()
c_enco = part.get('Content-Transfer-Encoding')
attachment_content = part.get_payload()
if c_enco == 'base64':
import base64
decoded_file = base64.b64decode(attachment_content)
print("File decoded from base64")
if c_type == "application/zip":
from cStringIO import StringIO
import zipfile
zfp = zipfile.ZipFile(StringIO(decoded_file), "r")
unzipped_list = zfp.open(zfp.namelist()[0]).readlines()
decoded_file = "".join(unzipped_list)
print('And un-zipped')
result = unicode(decoded_file, "utf-8")

Categories

Resources