UnicodeError Reading Accentuation Portuguese Characters from File

UnicodeError Reading Accentuation Portuguese Characters from File - python

Preface:
It's a cold, rainy day, in mid 2016, and a developer is still having encode issues with python for not using Python 3.0. Will the great S.O community help him ? I don't know, we will have to wait and see
Scope:
I have a UTF-8 encoded file that contains words with accentuation, such as CURRÍCULO and NÓS. For some reason I can't grasp, I can't manage to read them properly using Python 2.7.
Code Snippet:
import codecs
f_reader = codecs.open('PATH_TO_FILE/Data/Input/kw.txt', 'r', encoding='utf-8')
for line in f_reader:
keywords.append(line.strip().upper())
print line
The output I get is:
TRABALHE CONOSCO
ENVIE SEU CURRICULO
ENVIE SEU CURRÍCULO
UnicodeEncodeError, 'ascii' codec can't encode character u'\xcd' in position 14: ordinal not in range(128)
Encoding, Encoding, Encoding:
I have used notepad++ to convert the file to both regular utf-8 and the one without the ByteOrderMark, and it shows me the characters just fine, without any issue. I'm using Windows, by the way, which will create files as ANSI by default.
Question:
What should I do to be able to read this file properly, including the í and ó and other accentuated characters ?
Just to make it clearer, I want to keep the accentuation on the strings I use in memory.
Update:
Here's the List of Keywords, in memory, read from the file using the code you can see.

The problem seems not to be in the reading, but in the printing. You sad
I'm using Windows, by the way, which will create files as ANSI by default.
I think that includes printing to stdout. Try change the sys.output codec:
sys.stdout = codecs.getwriter("utf-8")(sys.stdout)

Related

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?

Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?

Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)

Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints ï¿½
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)

You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

Python read and write 'ß' from file

I have a file.txt with the input
Straße
Straße 1
Straße 2
I want to read this text from file and print it. I tried this, but it won´t work.
lmao1 = open('file.txt').read().splitlines()
lmao =random.choice(lmao1)
print str(lmao).decode('utf8')
But I get the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 5: invalid continuation byte

Got it. If this doesn't work try other common encodings until you find the right one. utf-8 is not the correct encoding.
print str(lmao).decode('latin-1')

If on Windows, the file is likely encoded in cp1252.
Whatever the encoding, use io.open and specify the encoding. This code will work in both Python 2 and 3.
io.open will return Unicode strings. It is good practice to immediately convert to/from Unicode at the I/O boundaries of your program. In this case that means reading the file as Unicode in the first place and leaving print to determine the appropriate encoding for the terminal.
Also recommended is to switch to Python 3 where Unicode handling is greatly improved.
from __future__ import print_function
import io
import random
with io.open('file.txt',encoding='cp1252') as f:
lines = f.read().splitlines()
line = random.choice(lines)
print(line)

You're on the right track, regarding decode, the problem is only there is no way to guess the encoding of a file 100%. Try a different encoding (e.g. latin-1).

It's working fine on Python prompt and while running from python script as well.
>>> import random
>>> lmao =random.choice(lmao1)
>>> lmao =random.choice(lmao1)
>>> print str(lmao).decode('utf8')
Straße 2
The above worked on Python 2.7. May I know your python version ?

International characters in Python

I'm currently working on a Python script that takes a list of log files (from a search engine) and produces a file with all the queries within these, for later analysis.
Another feature of the script is that it removes the most common words, which I've also implemented, but I've faced a problem I can't seem to overcome. The removing of words does work as intended, as long as the queries does not contain special characters. As the search logs are in Danish, the characters æ, ø and å will appear regularly.
Searching on the topic I'm now aware that I need to encode these into UTF-8, which I'm doing when obtaining the query:
tmp = t_query.encode("UTF-8").lower().split()
t_query is the query and I split it up to later compare each word with my list of forbidden words. If I do not use the encoding I'll get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 1: ordinal not in range(128)
Edit: I also tried using the decode instead, but get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa7' in position 3: ordinal not in range(128)
I loop through the words like this:
for i in tmp:
if i in words_to_filter:
tmp.remove(i)
As said this works perfectly for words not including special characters. I've tried to print the i along with the current forbidden word and will get e.g:
fÃ¦rdelsloven - færdelsloven
Where the first word is the ith element in tmp. The last word in the one from the forbidden words. Obviously something has gone wrong, but I just can't manage to find a solution. I've tried many suggestions found on Google and in here, but nothing have worked so far.
Edit 2: if it makes a difference, I've tried loading the log files both with and without the use of codec:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
I'm running Python 2.7.2 from a Windows environment, if it matters. The script should be able to run on other platforms (namely Linux and Mac OS).
I would really appreciate if one of you are able to help me out.
Best regards
Casper

If you are reading files, you want to decode them.
tmp = t_query.decode("UTF-8").lower().split()

Given a utf-8 file with json object per line, you could read all objects:
with open(filename) as file:
jlogs = [json.loads(line) for line in file]
Except for an embeded newline treatment the above code should produce the same result as yours:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
At this point all strings in jlogs are Unicode you don't need to do anything to handle "special" characters. Just make sure you are not mixing bytes and Unicode text in your code.
to get Unicode text from bytes: some_bytes.decode(character_encoding)
to get bytes from Unicode text: some_text.encode(character_encoding)
Don't encode bytes/decode Unicode.

If encoding is right and you just want to ignore unexpected characters you could use errors='ignore' or errors='replace' parameter passed to codecs.open function.
with codecs.open(file_name, encoding='utf-8', mode='r', errors='ignore') as f:
jlogs = map(json.loads, f.readlines())
Details in docs:
http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

I've finally solved it. As Lattyware Python 3.x seems to do much better. After changing the version and encoding the Python file to Unicode it works as intended.

Python: Ascii characters from file display wrong

Here's my code:
import sys, os
print("█████") #<-- Those are solid blocks.
f= open('file.txt')
for line in f:
print(line)
In file.txt is this:
hay hay, guys
████████████
But the output is this:
██████
hay hay, guys <----- ***Looks like it outptutted this correctly!***
Traceback (most recent call last):
File "echofile.py", line 6, in <module>
print(line)
File "C:\python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: cha
racter maps to <undefined> <------ ***But not from the file!***
Anybody have any suggestions as to why it is doing this? I wrote the code in IDLE, tried editing the file.txt in both Programmer's Notepad and IDLE. The file is ASCII / ANSI.
I'm using Python 3, by the way. 3.3 alpha win-64 if it matters.

This is clearly an issue with character encodings.
In Python 3.x, all strings are Unicode. But when reading or writing a file, it will be necessary to translate the Unicode to some specific encoding.
By default, a Python source file is handled as UTF-8. I don't know exactly what characters you pasted into your source file for the blocks, but whatever it is, Python reads it as UTF-8 and it seems to work. Maybe your text editor converted to valid UTF-8 when you inserted those?
The backtrace suggests that Python is treating the input file as "Code Page 437" or the original IBM PC 8-bit character set. Is that correct?
This link shows how to set a specific decoder to handle a particular file encoding on input:
http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide/
EDIT: I found a better resource:
http://docs.python.org/release/3.0.1/howto/unicode.html
And based on that, here's some sample code:
with open('mytextfile.txt', encoding='utf-8') as f:
for line in f:
print(line, end='')
Originally I had the above set to "cp437" but in a comment you said "utf-8" was correct, so I made that change to this example. I'm specifying end='' here because the input lines from the file already have a newline on the end, so we don't need print() to supply another newline.
EDIT: I found a short discussion of default encodings here:
http://docs.python.org/release/3.0.1/whatsnew/3.0.html
The important bit: "There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default."
So, I had thought that Python defaulted to UTF-8, but not always, it seems. Actually, from your stack backtrace, I think on your system with your LANG environment setting you are getting "cp437" as your default.
So, I learned something too by answering your question!
P.S. I changed the code example above to specify utf-8 since that is what you needed.

Try making that string unicode:
print(u"█████")
^ Add this

Why do Python unicode strings require special treatment for UTF-8 BOM?

For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:
with open('test.py') as f:
for line in f:
print unicode(line, 'utf-8')
Seems straightforward, doesn't it?
That's what I thought until I ran it from command line and got:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff'
in position 0: character maps to <undefined>
A brief visitation to Google revealed that BOM has to be cleared manually:
import codecs
with open('test.py') as f:
for line in f:
print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8')
This one runs fine. However I'm struggling to see any merit in this.
Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.

The 'utf-8-sig' encoding will consume the BOM signature on your behalf.

You wrote:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
When you specify the "utf-8" encoding in Python, it takes you at your word. UTF-8 files aren’t supposed to contain a BOM in them. They are neither required nor recommended. Endianness makes no sense with 8-bit code units.
BOMs screw things up, too, because you can no longer just do:
$ cat a b c > abc
if those UTF-8 files have extraneous (read: any) BOMs in them. See now why BOMs are so stupid/bad/harmful in UTF-8? They actually break things.
A BOM is metadata, not data, and the UTF-8 encoding spec makes no allowance for them the way the UTF-16 and UTF-32 specs do. So Python took you at your word and followed the spec. Hard to blame it for that.
If you are trying to use the BOM as a filetype magic number to specify the contents of the file, you really should not be doing that. You are really supposed to use a higher-level prototocl for these metadata purposes, just as you would with a MIME type.
This is just another lame Windows bug, the workaround for which is to use the alternate encoding "utf-8-sig" to pass off to Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.