UnicodeDecodeError On Unicode File Read

UnicodeDecodeError On Unicode File Read - python

I have a problem where, when I execute a script which involved reading in data from a file that contains unicode code points, everything works fine. But when it is executed via another application, it is raising the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
0: ordinal not in range(128)
I am executing the exact same code using the exact same data file. A sample datafile that replicates the problem is like this:
¥ Α © §
I called this sample.txt
A very simple python script to simply read in and print the file contents:
with open("sample.txt") as f:
for line in f:
print(line)
print("Done")
This executes fine from the command line; executing via Apache/CGI fails with the above error.

A hint to the problem came from the documentation of the open function:
In text mode, if encoding is not specified the encoding used is
platform dependent: locale.getpreferredencoding(False) is called to
get the current locale encoding.
[Link]
Platform dependent suggested environment variables. So, I inspected what environment variables were set for my shell, and found LANG set to en_US.UTF-8. Dumping the environment variables set by Apache found that LANG was missing.
So, apparently when locale cannot be determined, Python uses ASCII as the default file encoding. As a result, the error was encountered when the ordinal was out of range for ASCII.
To fix this, I set this environment variable in my CGI script. If the environment variable is somehow missing from a user shell, it can be set via normal methods, or just by:
export LANG=en_US.UTF-8
Or whatever preferred encoding is desired.
Note, the issue is probably far more noticeable if the locale is missing from a user shell, as text editors like vi will not display characters without it. It was significantly more subtle when only an issue when called from Apache (or some other application).

Related

UnicodeEncodeError in python3 when redirection is used

What I want to do: extract text information from a pdf file and redirect that to a txt file.
What I did:
pip install pdfminor
pdf2txt.py file.pdf > output.txt
What I got:
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence
My observation:
\u2022 is bullet point, •.
pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.
My question:
Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.
How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.

Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character • using the GBK codec. This probably means you're using a Chinese version of Windows.
A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.
You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.
set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt

You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.
Making the encoding parameter more explicit is probably what you want.
def gbk_to_utf8(source, target):
with open(source, "r", encoding="gbk") as src:
with open(target, "w", encoding="utf-8") as dst:
for line in src.readlines():
dst.write(line)

If my default python encoder and system is UTF-8, how Im getting this error-UnicodeEncodeError: 'ascii' codec can't encode character '\u0447'

Getting below error when I execute a python code from jenkins-
File "/export/app-33-1/jenkins/w/ee4a092a/install/src/linux-amd64-gcc_4_4-release/bin/eat2/eat.py", line 553, in _runtest
print('ERROR:' + msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u0447' in position 315:
ordinal not in range(128)
From where exactly it takes encoder - ascii as I have changed default encoding of python, jenkins master and slave process as well as systems.
Even added # coding: utf-8 at the start of script but didn't work.
Its not about only printing the string in console, my code tries to access some files and file path contains some Russian characters so everything fails.
When I run the same script manually from linux console, everything works.
Any idea what could be the solution here?

Contrary to wide-spread belief, the default encoding for the built-in open() function as well as the sys.std* streams (print() uses sys.stdout) is not always UTF-8 in Python 3. It might be on one machine, but not the other, because it's platform-dependent.
From the docs for sys.stdin/stdout/stderr:
These streams are regular text files like those returned by the open() function. Their parameters are chosen as follows:
The character encoding is platform-dependent. Non-Windows platforms use the locale encoding [...]
And later on:
Under all platforms, you can override the character encoding by setting the PYTHONIOENCODING environment variable before starting Python [...]
Note that there are some exceptions for Windows.
For files opened with open, you can easily get control by explicitly setting the encoding= parameter.

UnicodeEncodeError: 'ascii' codec can't encode character in print function

My company is using a database and I am writing a script that interacts with that database. There is already an script for putting the query on database and based on the query that script will return results from database.
I am working on unix environment and I am using that script in my script for getting some data from database and I am redirecting the result from the query to a file. Now when I try to read this file then I am getting an error saying-
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9741: ordinal not in range(128)
I know that python is not able to read file because of the encoding of the file. The encoding of the file is not ascii that's why the error is coming. I tried checking the encoding of the file and tried reading the file with its own encoding.
The code that I am using is-
os.system("Query.pl \"select title from bug where (ste='KGF-A' AND ( status = 'Not_Approved')) \">patchlet.txt")
encoding_dict3={}
encoding_dict3=chardet.detect(open("patchlet.txt", "rb").read())
print(encoding_dict3)
# Open the patchlet.txt file for storing the last part of titles for latest ACF in a list
with codecs.open("patchlet.txt",encoding='{}'.format(encoding_dict3['encoding'])) as csvFile
readCSV = csv.reader(csvFile,delimiter=":")
for row in readCSV:
if len(row)!=0:
if len(row) > 1:
j=len(row)-1
patchlets_in_latest.append(row[j])
elif len(row) ==1:
patchlets_in_latest.append(row[0])
patchlets_in_latest_list=[]
# calling the strip_list_noempty function for removing newline and whitespace characters
patchlets_in_latest_list=strip_list_noempty(patchlets_in_latest)
# coverting list of titles in set to remove any duplicate entry if present
patchlets_in_latest_set= set(patchlets_in_latest_list)
# Finding duplicate entries in list
duplicates_in_latest=[k for k,v in Counter(patchlets_in_latest_list).items() if v>1]
# Printing imp info for logs
print("list of titles of patchlets in latest list are : ")
for i in patchlets_in_latest_list:
**print(str(i))**
print("No of patchlets in latest list are : {}".format(str(len(patchlets_in_latest_list))))
Where Query.pl is the perl script that is written to bring in the result of query from database.The encoding that I am getting for "patchlet.txt" (the file used for storing result from HSD) is:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
Even when I have provided the same encoding for reading the file, then also I am getting the error.
Please help me in resolving this error.
EDIT:
I am using python3.6
EDIT2:
While outputting the result I am getting the error and there is one line in the file which is having some unknown character. The line looks like:
Some failure because of which vtrace cannot be used along with some trace.
I am using gvim and in gvim the "vtrace" looks like "~Vvtrace" . Then I checked on database manually for this character and the character is "–" which is according to my keyboard is neither hyphen nor underscore.These kinds of characters are creating the problem.
Also I am working on linux environment.
EDIT 3:
I have added more code that can help in tracing the error. Also I have highlighted a "print" statement (print(str(i))) where I am getting the error.

Problem
Based on the information in the question, the program is processing non-ASCII input data, but is unable to output non-ASCII data.
Specifically, this code:
for i in patchlets_in_latest_list:
print(str(i))
Results in this exception:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013'
This behaviour was common in Python2, where calling str on a unicode object would cause Python to try to encode the object as ASCII, resulting in a UnicodeEncodeError if the object contained non-ASCII characters.
In Python3, calling str on a str instance doesn't trigger any encoding. However calling the print function on a str will encode the str to sys.stdout.encoding. sys.stdout.encoding defaults to that returned by locale.getpreferredencoding. This will generally be your linux user's LANG environment variable.
Solution
If we assume that your program is not overriding normal encoding behaviour, the problem should be fixed by ensuring that the code is being executed by a Python3 interpreter in a UTF-8 locale.
be 100% certain that the code is being executed by a Python3 interpreter - print sys.version_info from within the program.
try setting the PYTHONIOENCODING environment variable when running your script: PYTHONIOENCODING=UTF-8 python3 myscript.py
check your locale using the locale command in the terminal (or echo $LANG). If it doesn't end in UTF-8, consider changing it. Consult your system administrators if you are on a corporate machine.
if your code runs in a cron job, bear in mind that cron jobs often run with the 'C' or 'POSIX' locale - which could be using ASCII encoding - unless a locale is explicitly set. Likewise if the script is run under a different user, check their locale settings.
Workaround
If changing the environment is not feasible, you can workaround the problem in Python by encoding to ASCII with an error handler, then decoding back to str.
There are four useful error handlers in your particular situation, their effects are demonstrated with this code:
>>> s = 'Hello \u2013 World'
>>> s
'Hello – World'
>>> handlers = ['ignore', 'replace', 'xmlcharrefreplace', 'namereplace']
>>> print(str(s))
Hello – World
>>> for h in handlers:
... print(f'Handler: {h}:', s.encode('ascii', errors=h).decode('ascii'))
...
Handler: ignore: Hello World
Handler: replace: Hello ? World
Handler: xmlcharrefreplace: Hello – World
Handler: namereplace: Hello \N{EN DASH} World
The ignore and replace handlers lose information - you can't tell what character has been replaced with an space or question mark.
The xmlcharrefreplace and namereplace handlers do not lose information, but the replacement sequences may make the text less readable to humans.
It's up to you to decide which tradeoff is acceptable for the consumers of your program's output.
If you decided to use the replace handler, you would change your code like this:
for i in patchlets_in_latest_list:
replaced = i.encode('ascii', errors='replace').decode('ascii')
print(replaced)
wherever you are printing data that might contain non-ASCII characters.

UnicodePython 3: EncodeError: 'ascii' codec can't encode character '\xe4' in position 277: ordinal not in range(128)

When I tried to make request with python requests library like below. I am getting the below exception
def get_request(url):
return requests.get(url).json()
Exception
palo:dataextractor minisha$ python escoskill.py
Traceback (most recent call last):
File "escoskill.py", line 62, in <module>
print(response.json())
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 277: ordinal not in range(128)
However, the same piece of code works for some request and not for all. For the below url, it doesn't work.
https://ec.europa.eu/esco/api/resource/concept?uri=http://data.europa.eu/esco/isco/C2&language=en
Url that works
https://ec.europa.eu/esco/api/resource/taxonomy?uri=http://data.europa.eu/esco/concept-scheme/isco&language=en

The exception you're getting, UnicodeEncodeError, means we have some character that we cannot encode into bytes. In this case, we're trying to encode \xe4, or ä, which ASCII¹ does not have, hence, error.
In this line of code:
print(response.json())
The only thing that's going to be doing encoding is the print(). print(), to emit text into something, needs to encode it to bytes. Now, what it does by default depends on what sys.stdout is. Typically, stdout is your screen unless you've redirected output to a file. On Unix-like OSs (Linux, OS X), the encoding Python will use will be whatever LANG is set to; typically, this should be something like en_US.utf8 (the first part, en_US, might differ if you're in a different country; the utf8 bit is what is important here). If LANG isn't set (this is unusual, but can happen in some contexts, such as Docker containers) then it defaults to C, for which Python will use ASCII as an encoding.
(Edit) From the additional information in the comments (you're on OS X, you're using IntelliJ, and LANG is unset (print(repr(os.environ['LANG']))) printed None)), this is a tough one to give advice on. LANG being unset means Python will assume it can only output ASCII, and error out, as you've seen, on anything else. In order of preference, I would:
Try to figure out why LANG is unset. This might be some configuration of the mini-terminal in the IDE, if that is what you have and are using. This may prove hard to find if you're unfamiliar with character encodings, and I might be off-base here, as I'm unfamiliar with IntelliJ.
Since you seem to be running your program from a command line, you can see if setting LANG helps. Where currently you are doing,
python escoskill.py
You can set LANG for a single run with:
LANG=en_US.utf8 python escoskill.py
If that works, you can make it last for however long that session is by doing,
export LANG=en_US.utf8
# LANG will have that value for all future commands run from this terminal.
python escoskill.py
You can override what Python autodetects the encoding to be, or you can override its behavior when it hits a character it can't encode. For example,
PYTHONIOENCODING=ascii:replace 'print("\xe4")'
tells Python to use the output encoding of ASCII (which is what it was doing before anyways) but the :replace bit will make characters that it can't encode in ASCII, such as ä, be emitted as ?s instead of erroring out. This might make some things harder to read, of course.
¹ASCII is a character encoding. A character encoding tells one how to translate bytes into characters. There's not just one, because… humans.
²or perhaps your OS, but LANG being unset on OS X just sounds very implausible

UnicodeEncodeError - works in Spyder but not when executed from terminal

I'm using BeautifulSoup to Parse some html, with Spyder as my editor (both brilliant tools by the way!). The code runs fine in Spyder, but when I try to execute the .py file from terminal, I get an error:
file = open('index.html','r')
soup = BeautifulSoup(file)
html = soup.prettify()
file1 = open('index.html', 'wb')
file1.write(html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 5632: ordinal not in range(128)
I'm running OPENSUSE on a linux server, with Spyder installed using zypper.
Does anyone have any suggestions what the problem might be?
Many thanks.

That is because because before outputting the result (i.e writing it to the file) you must encode it first:
file1.write(html.encode('utf-8'))
See every file has an attribute file.encoding. To quote the docs:
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
See the last sentence? soup.prettify returns a Unicode object and given this error, I'm pretty sure you're using Python 2.7 because its sys.getdefaultencoding() is ascii.
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.