I'm trying to decode a zip file so i can create an md5 hash out of it.
When I run this
contents = response['Body'].read()
decoded = contents.decode('utf-8')
I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 10: invalid start byte
What worked for me is IBM437
contents.decode('IBM437')
The whole hashing flow works in AWS's ContentMD5 check on upload
I'm not reformatting on how it's uploaded I'm just encoding it in a non utf-8 to create an md5 hash that works
What are the caveats of doing this in IBM437, since utf-8 seem like the "standard" type.
As someone suggested in the comments. We don't have to convert the binary to a string.
So this would be the re-approach
contents = response['Body'].read()
md = hashlib.md5(contents).digest()
contents_md5 = base64.b64encode(md).decode('utf-8')
Related
I'm trying to receive a CSV file from an Amazon S3 bucket like this:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='BUCKET_NAME', Key='FILE_NAME')
data = obj['Body'].read().decode('utf-8').splitlines()
But this is what I get:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 10: invalid start byte
I thought I got the encoding of the file wrong, so I ran file FILE_NAME.csv and got UTF-8. I have also tried latin-1 and some other encodings, but all of them give me gibberish. Any help is appreciated, and thank you all!
This happens if your string has non ASCII characters encoded in it and it is unable to decode with UTF-8 (may happen if needed to use other encodings). In such a scenario, you need to use the encoding windows-1252.
The below solution works fine for me. By default, it tries to decode with UTF-8 as it is recommended to use this encoding over others. If it gets an error while decoding then it tries with windows-1252.
message = 'some message with special characters'
try:
return message.decode('utf-8')
except:
return message.decode('windows-1252')
Now you can even encode it back with UTF-8.
I have generated a binary file with the word2vec, stored the resulting .bin file to my GCS bucket, and ran the following code in my App Engine app handler:
gcs_file = gcs.open(filename, 'r')
content = gcs_file.read().encode("utf-8")
""" call word2vec with content so it doesn't need to read a file itself, as we don't have a filesystem in GAE """
Fails with this error:
content = gcs_file.read().encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 15: ordinal not in range(128)
A similar decode error happens if I try gcs_file.read(), or gcs_file.read().decode("utf-8").encode("utf-8").
Any ideas on how to read a binary file from GCS?
Thanks
If it is binary then it will not take a character encoding, which is what UTF-8 is. UTF-8 is just one possible binary encoding of the Unicode specification for character sets ( String data ). You need to go back and read up on what UTF-8 and ASCII represent and how they are used.
If it was not text data that was encoded with a specific encoding then it is not going to magically just decode, which is why you are getting that error. can't decode byte 0xf6 in position 15 is not a valid ASCII value.
I'm using python3.3. I've been trying to decode a certain string that looks like this:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc....
keeps going on. However whenever I try to decode this string using str.decode('utf-16') I get an error saying:
'utf16' codec can't decode bytes in position 54-55: illegal UTF-16 surrogate
I'm not exactly sure how to decode this string.
gzipped data begins with \x1f\x8b\x08 so my guess is that your data is gzipped. Try gunzipping the data before decoding.
import io
import gzip
# this raises IOError because `buf` is incomplete. It may work if you supply the complete buf
buf = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc'
with gzip.GzipFile(fileobj=io.BytesIO(buf)) as f:
content = f.read()
print(content.decode('utf-16'))
I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7
PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.
The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.
I have the following error when I trying to send Zipfile content via suds method
'ascii' codec can't decode byte 0x8c in position 10: ordinal not in range(128)
Here is my code:
try:
project_archive = open(os.path.join(settings.MEDIA_ROOT, 'zip/project.zip'), "rb")
data = project_archive.read()
client = Client(settings.UPLOAD_PROJECT_WS_URL)
client.service.uploadProject(data)
except Exception as e:
return HttpResponse(e)
else:
return HttpResponse("Project was exported")
suds doesn't support soap file attachment (not last time I checked, it has been a while).
Work around here:
https://fedorahosted.org/suds/attachment/ticket/350/soap_attachments.2.py
or use a different library
Assuming that in the WSDL the argument type is xsd:base64Binary, you need to:
client.service.uploadProject(base64.b64encode(data))
In my case the sever was written in JAX-WS and the function argument type was Byte[] and Base64 worked for me
The problem seems to be simply that you are trying to read a unicode formatted file using ascii codec. Refer to http://docs.python.org/2/howto/unicode.html for the official documentation on unicode. You can also look at Unicode (UTF-8) reading and writing to files in Python for a similar discussion.
In short for your problem the following code should work:
import codecs
project_archive = codecs.open(os.path.join(settings.MEDIA_ROOT, 'zip/project.zip'),
"rb", "utf-8")
data = project_archive.read()
In the above solution its assumed that the unicode encoding used is utf-8. If some other codec (for example ISO-8859-1) is being used substitute that for the utf-8.