Decoding issue when downloading from an Amazon S3 bucket - python

I'm trying to receive a CSV file from an Amazon S3 bucket like this:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='BUCKET_NAME', Key='FILE_NAME')
data = obj['Body'].read().decode('utf-8').splitlines()
But this is what I get:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 10: invalid start byte
I thought I got the encoding of the file wrong, so I ran file FILE_NAME.csv and got UTF-8. I have also tried latin-1 and some other encodings, but all of them give me gibberish. Any help is appreciated, and thank you all!

This happens if your string has non ASCII characters encoded in it and it is unable to decode with UTF-8 (may happen if needed to use other encodings). In such a scenario, you need to use the encoding windows-1252.
The below solution works fine for me. By default, it tries to decode with UTF-8 as it is recommended to use this encoding over others. If it gets an error while decoding then it tries with windows-1252.
message = 'some message with special characters'
try:
return message.decode('utf-8')
except:
return message.decode('windows-1252')
Now you can even encode it back with UTF-8.

Related

Decoding base64 encoded string

I'm playing a game that I believe uses a base64-encoded save, and I'm trying to figure out how to get a json or something equivalent of the save in python.
First, I'm decoding it from base64 (using base64.b64decode), which gives me bytes that look like
b'Salted__\x98&v\\\x99\xa8\xbb\x10\xaa"\xbbf\xea\x99\xbf}\x95\x96\xd0\xba~\xef#\x00\x16v\x01\xa5\x11=\xfa;\xc0\xec\xcb\x95h\xfcH\...
when printed. I'm having trouble decoding the bytes, however. Doing decoded_bytes.decode() gives an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 8: invalid start byte I've tried passing a few different encodings to decode, but none of them seem to work. It looks like the first 8 bytes are a salt, but removing those 8 and trying to decode doesn't work either. Any help would be greatly appreciated.

Decoding a zip file downloaded from s3

I'm trying to decode a zip file so i can create an md5 hash out of it.
When I run this
contents = response['Body'].read()
decoded = contents.decode('utf-8')
I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 10: invalid start byte
What worked for me is IBM437
contents.decode('IBM437')
The whole hashing flow works in AWS's ContentMD5 check on upload
I'm not reformatting on how it's uploaded I'm just encoding it in a non utf-8 to create an md5 hash that works
What are the caveats of doing this in IBM437, since utf-8 seem like the "standard" type.
As someone suggested in the comments. We don't have to convert the binary to a string.
So this would be the re-approach
contents = response['Body'].read()
md = hashlib.md5(contents).digest()
contents_md5 = base64.b64encode(md).decode('utf-8')

How to open a binary file stored in Google App Engine?

I have generated a binary file with the word2vec, stored the resulting .bin file to my GCS bucket, and ran the following code in my App Engine app handler:
gcs_file = gcs.open(filename, 'r')
content = gcs_file.read().encode("utf-8")
""" call word2vec with content so it doesn't need to read a file itself, as we don't have a filesystem in GAE """
Fails with this error:
content = gcs_file.read().encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 15: ordinal not in range(128)
A similar decode error happens if I try gcs_file.read(), or gcs_file.read().decode("utf-8").encode("utf-8").
Any ideas on how to read a binary file from GCS?
Thanks
If it is binary then it will not take a character encoding, which is what UTF-8 is. UTF-8 is just one possible binary encoding of the Unicode specification for character sets ( String data ). You need to go back and read up on what UTF-8 and ASCII represent and how they are used.
If it was not text data that was encoded with a specific encoding then it is not going to magically just decode, which is why you are getting that error. can't decode byte 0xf6 in position 15 is not a valid ASCII value.

Trouble decoding utf-16 string

I'm using python3.3. I've been trying to decode a certain string that looks like this:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc....
keeps going on. However whenever I try to decode this string using str.decode('utf-16') I get an error saying:
'utf16' codec can't decode bytes in position 54-55: illegal UTF-16 surrogate
I'm not exactly sure how to decode this string.
gzipped data begins with \x1f\x8b\x08 so my guess is that your data is gzipped. Try gunzipping the data before decoding.
import io
import gzip
# this raises IOError because `buf` is incomplete. It may work if you supply the complete buf
buf = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc'
with gzip.GzipFile(fileobj=io.BytesIO(buf)) as f:
content = f.read()
print(content.decode('utf-16'))

Python zipfile sending via suds error: "'ascii' codec can't decode byte 0x8c in position 10: ordinal not in range(128)"

I have the following error when I trying to send Zipfile content via suds method
'ascii' codec can't decode byte 0x8c in position 10: ordinal not in range(128)
Here is my code:
try:
project_archive = open(os.path.join(settings.MEDIA_ROOT, 'zip/project.zip'), "rb")
data = project_archive.read()
client = Client(settings.UPLOAD_PROJECT_WS_URL)
client.service.uploadProject(data)
except Exception as e:
return HttpResponse(e)
else:
return HttpResponse("Project was exported")
suds doesn't support soap file attachment (not last time I checked, it has been a while).
Work around here:
https://fedorahosted.org/suds/attachment/ticket/350/soap_attachments.2.py
or use a different library
Assuming that in the WSDL the argument type is xsd:base64Binary, you need to:
client.service.uploadProject(base64.b64encode(data))
In my case the sever was written in JAX-WS and the function argument type was Byte[] and Base64 worked for me
The problem seems to be simply that you are trying to read a unicode formatted file using ascii codec. Refer to http://docs.python.org/2/howto/unicode.html for the official documentation on unicode. You can also look at Unicode (UTF-8) reading and writing to files in Python for a similar discussion.
In short for your problem the following code should work:
import codecs
project_archive = codecs.open(os.path.join(settings.MEDIA_ROOT, 'zip/project.zip'),
"rb", "utf-8")
data = project_archive.read()
In the above solution its assumed that the unicode encoding used is utf-8. If some other codec (for example ISO-8859-1) is being used substitute that for the utf-8.

Categories

Resources