It appears that the Python 3.7 interpreter won't accept a module that has triple-quoted strings containing special characters like the degree symbol. (I don't want to use an encoding for the degree symbol because the comments are the benefit of someone looking at the code, which will then become less intelligible). Is there any way to get around this?
This problem can be reproduced if the Python file is incorrectly encoded with an 8-bit encoding instead of UTF-8. The byte 0xb0 maps to the degree symbol in many 8-bit encodings, as can be seen here.
The error is reproduced if the python file is encoded as latin-1
iconv --from-code=utf-8 --to-code=latin1 special_char.py > latin_1_char.py
python3.7 latin_1_char.py
File "latin_1_char.py", line 4
"""
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xb0 in position 145: invalid start byte
or as cp1252
iconv --from-code=utf-8 --to-code=cp1252 char.py > cp1252_char.py
(so38) kdwyer#osiris:~/p/so38 $ python3.7 cp1252_char.py
File "cp1252_char.py", line 4
"""
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xb0 in position 145: invalid start byte
but not if the file is encoded as utf-8
iconv --from-code=latin1 --to-code=utf-8 latin_1_char.py > utf8_char.py
python3.7 utf8_char.py
Hello!
Related
I've received a .xlsx file where foreign language characters appear to have to have been originally encoded as utf-8 and then decoded as utf-16.
I don't have access to the original file.
é and ó, for example, should be read as é and ó, respectively, which are encoded as 0xC3 0xA9 and 0xC3 0xB3. Instead each single utf-8 character has been decoded at some point as two utf-16 characters.
I've tried encoding them to bytes and decoding them with UTF-8, but that doesn't translate correctly.
This:
s = "ó".encode("utf-16")
uni = s.decode("utf-8")
print(uni)
returns this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I've tried the above encode/decode sequence with a variety of different parameters including UTF-16be and UTF-16le, and every error parameter of decode, and just slicing off the BOM, none of which worked.
Is there a way to fix this? Some library that can make this a less painful process than doing a literal replacement by reading every string and replacing ó with ó?
I have the following sh file:
#!/bin/bash
python <py_filename>.py
The file filename.py begins with
#!/usr/bin/env python
# coding: utf-8
and contains a list of strings with special characters like ã and ç.
When I do
sh <sh_filename>.sh
The following error is returned:
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe1 in position 9: invalid continuation byte
and I can see that, for instance, the word 'Ações' is represented as 'A��es'.
I understand that is something related to file encoding but I don't know how to solve it.
I have tried using open("oxeb.txt").read() in python 2 and it works but it doesn't work in python 3.
I know that the default encoding in python 2 is ascii and the default encoding in python 3 is utf8 so I tried doing this in python 3: open("oxeb.txt").read() and it STILL doesn't work.
How can I read a file with this character in it - independent of my python version?
Note: this is the error I get UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 4: invalid continuation byte
You can open the file in binary mode.
Obviously then you no longer have printable string data, but binary data.
So you will need to convert it.
text = open("oxeb.txt","rb").read()
text = text.decode('iso-8859-1')
I have an input file in Windows-1252 encoding that contains the '®' character. I need to write this character to a UTF-8 file. Also assume I must use Python 2.7. Seems easy enough, but I keep getting UnicodeDecodeErrors.
I originally had just opened the original file using codecs.open() with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043:
invalid start byte
I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. But that produced a new error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22:
ordinal not in range(128)
Here is a minimum working example:
with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
for line in inf:
of.write(line.encode('utf-8'))
Here is the contents of in.txt:
Sample file
Here is my sample file® yay.
I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so:
of.write(line.decode('cp1252').encode('utf-8'))
But that also didn't work, giving the same error as when I just opened it as UTF-8.
How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? The above method has always worked for me in the past until I encountered the ® character.
Your file is not in Windows-1252 if 0xC2 should represent the ® character; in Windows-1252, 0xC2 is Â.
However, you should just use
of.write(line)
since encoding properly is the whole reason you're using codecs in the first place.
I have read a lot now on the topic of UTF-8 encoding in Python 3 but it still doesn't work, and I can't find my mistake.
My code looks like this
def main():
with open("test.txt", "rU", encoding='utf-8') as test_file:
text = test_file.read()
print(str(len(text)))
if __name__ == "__main__":
main()
My test.txt file looks like this
ö
And I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
Your file is not UTF-8 encoded. I'm not sure what encoding uses F6 for ä either; that codepoint is the encoding for ö in Latin 1 and CP-1252:
>>> b'\xf6'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
>>> b'\xf6'.decode('latin1')
'ö'
You'll need to save that file as UTF-8 instead, with whatever tool you used to create that file.
If open('text').read() works, then you were able to decode the file using the default system encoding. See the open() function documentation:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.
That is not to say that you were reading the file using the correct encoding; that just means that the default encoding didn't break (encountered bytes for which it doesn't have a character mapping). It could still be mapping those bytes to the wrong characters.
I urge you to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder