UnicodeEncodeError - works in Spyder but not when executed from terminal - python

I'm using BeautifulSoup to Parse some html, with Spyder as my editor (both brilliant tools by the way!). The code runs fine in Spyder, but when I try to execute the .py file from terminal, I get an error:
file = open('index.html','r')
soup = BeautifulSoup(file)
html = soup.prettify()
file1 = open('index.html', 'wb')
file1.write(html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 5632: ordinal not in range(128)
I'm running OPENSUSE on a linux server, with Spyder installed using zypper.
Does anyone have any suggestions what the problem might be?
Many thanks.

That is because because before outputting the result (i.e writing it to the file) you must encode it first:
file1.write(html.encode('utf-8'))
See every file has an attribute file.encoding. To quote the docs:
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
See the last sentence? soup.prettify returns a Unicode object and given this error, I'm pretty sure you're using Python 2.7 because its sys.getdefaultencoding() is ascii.
Hope this helps!

Related

UnicodeEncodeError in python3 when redirection is used

What I want to do: extract text information from a pdf file and redirect that to a txt file.
What I did:
pip install pdfminor
pdf2txt.py file.pdf > output.txt
What I got:
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence
My observation:
\u2022 is bullet point, •.
pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.
My question:
Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.
How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.
Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character • using the GBK codec. This probably means you're using a Chinese version of Windows.
A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.
You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.
set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt
You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.
Making the encoding parameter more explicit is probably what you want.
def gbk_to_utf8(source, target):
with open(source, "r", encoding="gbk") as src:
with open(target, "w", encoding="utf-8") as dst:
for line in src.readlines():
dst.write(line)

Python 3.6.1 - Printing string as human readable text, special characters

I'm building a little django 1.1 app (though I believe this issue to be specific to Python) where I've come to use commands to control the flow of getting and categorizing data. I also wish to print a sort of summary using a third command. I am using macOS 10.12.3
My problem comes from getting text data in and printing it to the console or a document using
> or >>
in the console.
I'm running these scripts using an alias of Python 3.6.1
I'm using the Tweepy api, but that should hopefully not be relevant.
These snippets should illustrate the problem I'm hoping to solve:
print(type(data))
print(type(data.text))
try:
print(data.text)
except UnicodeEncodeError:
print("no printing today :(")
print(type(data.text.encode('UTF-8')))
print(data.text.encode('UTF-8'))
this outputs:
<class 'tweepy.models.Status'>
<class 'str'>
no printing today :(
<class 'bytes'>
b'kontroll p\xc3\xa5 ... v\xc3\xa5pen.'
The ugly things there should both be the character 'å'.
This is the error that would be thrown:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 223: ordinal not in range(128)
It says 'ascii' codec, but doing (in my Python 3.6.1 script):
print(sys.getdefaultencoding())
outputs:
utf-8
Running
print(sys.getdefaultencoding())
again in Python 2.7.10 outputs:
ascii
So the thrown error matches what 2.7.10 outputs. I am not discounting the possibility that I could be wrong about what a default encoder does
I have also tried
export LOCALE="no_NB.UTF-8"
in an attempt to see if this could be caused by my system (unless I'm misunderstanding what this does). I did not write this to any file, thinking it would persist through the current session.
Is the wrong encoder being used somehow? Could it be my terminal encoding? How can I write my special characters to the terminal and file? Are strings really this hard to get right?
Any help is greatly appreciated!!
Setting
export LC_ALL=no_NO.UTF-8
export LANG=no_NO.UTF-8
in my .bash_profile now allows me to see the characters I want in my terminal and it is also successfully echoed to a file.

“UnicodeEncodeError: 'ascii' codec can't encode character” in Python3

I'm fetching JSON with Requests from an API (using Python 3.5) and when I'm trying to print (or use) the JSON, either by response.text, json.loads(...) or response.json(), I get an UnicodeEncodeError.
print(response.text)
UnicodeEncodeError: 'ascii' codec can't encode character '\xc5' in position 676: ordinal not in range(128)
The JSON contains an array of dictionaries with country names and some of them contain special characters, e.g.: (just one dictionary in the binary array for example)
b'[{\n "name" : "\xc3\x85land Islands"\n}]
I have no idea why there is an encoding problem and also why "ascii" is used when Requests detects an UTF-8 encoding (and even by setting it manually to UTF-8 doesn't change anything).
Edit2: The problem was Microsoft Visual Studio Code 1.4. It wasn't able to print the characters.
If your code is running within VS, then it sounds that Python can't work out the encoding of the inbuilt console, so defaults to ASCII. If you try to print any non-ASCII then Python throws an error rather printing text that won't display.
You can force Python's encoding by using the PYTHONIOENCODING environment variable. Set it within the run configuration for the script.
Depending on Visual Studio's console, you may get away with:
PYTHONIOENCODING=utf-8
or you may have to use a typical 8bit charset like:
PYTHONIOENCODING=windows-1252

UnicodeDecodeError On Unicode File Read

I have a problem where, when I execute a script which involved reading in data from a file that contains unicode code points, everything works fine. But when it is executed via another application, it is raising the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
0: ordinal not in range(128)
I am executing the exact same code using the exact same data file. A sample datafile that replicates the problem is like this:
¥ Α © §
I called this sample.txt
A very simple python script to simply read in and print the file contents:
with open("sample.txt") as f:
for line in f:
print(line)
print("Done")
This executes fine from the command line; executing via Apache/CGI fails with the above error.
A hint to the problem came from the documentation of the open function:
In text mode, if encoding is not specified the encoding used is
platform dependent: locale.getpreferredencoding(False) is called to
get the current locale encoding.
[Link]
Platform dependent suggested environment variables. So, I inspected what environment variables were set for my shell, and found LANG set to en_US.UTF-8. Dumping the environment variables set by Apache found that LANG was missing.
So, apparently when locale cannot be determined, Python uses ASCII as the default file encoding. As a result, the error was encountered when the ordinal was out of range for ASCII.
To fix this, I set this environment variable in my CGI script. If the environment variable is somehow missing from a user shell, it can be set via normal methods, or just by:
export LANG=en_US.UTF-8
Or whatever preferred encoding is desired.
Note, the issue is probably far more noticeable if the locale is missing from a user shell, as text editors like vi will not display characters without it. It was significantly more subtle when only an issue when called from Apache (or some other application).

Converting from ascii to utf-8 with Python

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
output.decode('ascii')
in python console I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4
PS: Sorry for my English :(
sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
You need to work out what encoding the actual text is, either manually, or using the locale module.
Typically something like:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶
Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
See http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.
You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.
Update You say that you still need cp866 in IDLE. Note this:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.
In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.

Categories

Resources