How to open a binary file stored in Google App Engine? - python

I have generated a binary file with the word2vec, stored the resulting .bin file to my GCS bucket, and ran the following code in my App Engine app handler:
gcs_file = gcs.open(filename, 'r')
content = gcs_file.read().encode("utf-8")
""" call word2vec with content so it doesn't need to read a file itself, as we don't have a filesystem in GAE """
Fails with this error:
content = gcs_file.read().encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 15: ordinal not in range(128)
A similar decode error happens if I try gcs_file.read(), or gcs_file.read().decode("utf-8").encode("utf-8").
Any ideas on how to read a binary file from GCS?
Thanks

If it is binary then it will not take a character encoding, which is what UTF-8 is. UTF-8 is just one possible binary encoding of the Unicode specification for character sets ( String data ). You need to go back and read up on what UTF-8 and ASCII represent and how they are used.
If it was not text data that was encoded with a specific encoding then it is not going to magically just decode, which is why you are getting that error. can't decode byte 0xf6 in position 15 is not a valid ASCII value.

Related

Decoding a zip file downloaded from s3

I'm trying to decode a zip file so i can create an md5 hash out of it.
When I run this
contents = response['Body'].read()
decoded = contents.decode('utf-8')
I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 10: invalid start byte
What worked for me is IBM437
contents.decode('IBM437')
The whole hashing flow works in AWS's ContentMD5 check on upload
I'm not reformatting on how it's uploaded I'm just encoding it in a non utf-8 to create an md5 hash that works
What are the caveats of doing this in IBM437, since utf-8 seem like the "standard" type.
As someone suggested in the comments. We don't have to convert the binary to a string.
So this would be the re-approach
contents = response['Body'].read()
md = hashlib.md5(contents).digest()
contents_md5 = base64.b64encode(md).decode('utf-8')

Decoding issue when downloading from an Amazon S3 bucket

I'm trying to receive a CSV file from an Amazon S3 bucket like this:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='BUCKET_NAME', Key='FILE_NAME')
data = obj['Body'].read().decode('utf-8').splitlines()
But this is what I get:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 10: invalid start byte
I thought I got the encoding of the file wrong, so I ran file FILE_NAME.csv and got UTF-8. I have also tried latin-1 and some other encodings, but all of them give me gibberish. Any help is appreciated, and thank you all!
This happens if your string has non ASCII characters encoded in it and it is unable to decode with UTF-8 (may happen if needed to use other encodings). In such a scenario, you need to use the encoding windows-1252.
The below solution works fine for me. By default, it tries to decode with UTF-8 as it is recommended to use this encoding over others. If it gets an error while decoding then it tries with windows-1252.
message = 'some message with special characters'
try:
return message.decode('utf-8')
except:
return message.decode('windows-1252')
Now you can even encode it back with UTF-8.

python cleaning high or non-ascii out of a file [duplicate]

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

RIGHT-TO-LEFT char \u200f causing problems in python

I am reading a web page using urllib which is in utf-8 charset and includes the RIGHT-TO-LEFT charecter
http://www.charbase.com/200f-unicode-right-to-left-mark
but when I try to write all of that into a UTF-8 based text file
with codecs.open("results.html","w","utf-8") as outFile:
outFile.write(htmlResults)
outFile.close()
I get the
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 264: ordinal not in range(128)"
error message ....
how do I fix that ?
If htmlResults is of type str then you need to figure out what encoding it is in if you're going to decode it to Unicode (only Unicode can be encoded). E.g. if htmlResults is encoded as iso-8859-1 (i.e. latin-1), then
tmp = htmlResults.decode('iso-8859-1')
would create a Unicode string in tmp which you could then write to a file:
with codecs.open("results.html","w","utf-8") as outFile:
tmp = htmlResults.decode('iso-8859-1')
outFile.write(tmp)
if htmlResults is encoded as utf-8 then you do not need to do any decoding/encoding:
with open('results.html', 'w') as fp:
fp.write(htmlResults)
(the with statement will close the file for you).
This is not related to what browsers interpret the file as however, that is determined by the Content-Type that the web server is serving the file as, along with associated meta tags. E.g. if the file is html5 you ought to have this near the top of the head tag:
<meta charset="UTF-8">

unicode error with codecs when reading a pdf file in python

I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7
PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.
The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

Categories

Resources