Decoding HTML Entities to Unicode

Decoding HTML Entities to Unicode - python

Well, Since yesterday I'm having trouble with this. I need to save some text into a ".txt" file, the problem is that there are html entities in the text I'm trying to save.
So I imported HTMLPaser in my code:
import HTMLParser
h = HTMLParser.HTMLParser()
print h.unescape(text) // right?
the thing is that this works when you try to print the result, but i'm trying to return this to a function of mine which actually saves the text to the file. So, when I'm trying to save the file, the system says:
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 0: ordinal not in range(128)
I've been reading about this but I cannot conclude anything, I tried BeautifulSoup, I tried functions from famous pythonists and none worked. Can you help me with this? I need to save the text in the file as unicode and by unicode I understand it will save characters like: á, right?

"Save Unicode character to a file" is a different question from "Decoding HTML Entities to Unicode". Your code (h.unescape(text)) already decodes the html text correctly.
The exception is due to print unicode_text e.g.:
print u"\N{EURO SIGN}"
should produce a similar error.
If you're saving to a file by redirecting the output of the python script e.g.:
$ python -m your_module >output.txt #XXX raises an error for non-ascii data
then define PYTHONIOENCODING=utf-8 envvar (to save using utf-8 encoding):
$ PYTHONIOENCODING=utf-8 python -m your_module >output.txt
If you want to save to a file directly in your Python code, use io module:
import io
with io.open(filename, 'w', encoding='utf-8') as file:
file.write(h.unescape(text))

Related

UnicodeEncodeError in python3 when redirection is used

What I want to do: extract text information from a pdf file and redirect that to a txt file.
What I did:
pip install pdfminor
pdf2txt.py file.pdf > output.txt
What I got:
UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence
My observation:
\u2022 is bullet point, •.
pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.
My question:
Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.
How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.

Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character • using the GBK codec. This probably means you're using a Chinese version of Windows.
A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.
You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.
set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt

You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.
Making the encoding parameter more explicit is probably what you want.
def gbk_to_utf8(source, target):
with open(source, "r", encoding="gbk") as src:
with open(target, "w", encoding="utf-8") as dst:
for line in src.readlines():
dst.write(line)

Python: Reading UTF - 8 from raw_input() & Writing UTF - 8 in file

So, i would like to make a program does 2 things:
Reads A Word
Reads the translation in Greek
Then I make a new format that looks like this: "word,translation" and i'm writing it into a file.
So the test.txt file should contain "Hello,Γεια" and in case i read again , the next line should go under this one.
word=raw_input("Word:\n") #The Word
translation=raw_input("Translation:\n").decode("utf-8") #The Translation in UTF-8
format=word+","+translation+"\n"
file=open("dict.txt","w")
file.write(format.encode("utf-8"))
file.close()
The Error I get:
UnicodeDecodeError 'utf8'codec can't decode byte 0x82 in position 0: invalid start byte
EDIT: This is Python 22

Although python 2 supports unicode, its input is not automatically decoded into unicode for you. raw_input returns a string and if something other than ascii is piped in, you get the encoded bytes. The trick is to figure out what that encoding is. And that depends on whatever is pumping data into the program. if its a terminal, then sys.stdin.encoding should tell you what encoding to use. If its piped in from, say, a file, then sys.stdin.encoding is None and you just kinda have to know what it is.
A solution to your problem follows. Note that even though your method of writing the file (encode then write) works, the codecs module imports a file object that does it for you.
import sys
import codecs
# just randomly picking an encoding.... a command line param may be
# useful if you want to get input from files
_stdin_encoding = sys.stdin.encoding or 'utf-8'
def unicode_input(prompt):
return raw_input(prompt).decode(_stdin_encoding)
word=unicode_input("Word:\n") #The Word
translation=unicode_input("Translation:\n")
format=word+","+translation+"\n"
with codecs.open("dict.txt","w") as myfile:
myfile.write(format)

UnicodeEncodeError - works in Spyder but not when executed from terminal

I'm using BeautifulSoup to Parse some html, with Spyder as my editor (both brilliant tools by the way!). The code runs fine in Spyder, but when I try to execute the .py file from terminal, I get an error:
file = open('index.html','r')
soup = BeautifulSoup(file)
html = soup.prettify()
file1 = open('index.html', 'wb')
file1.write(html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 5632: ordinal not in range(128)
I'm running OPENSUSE on a linux server, with Spyder installed using zypper.
Does anyone have any suggestions what the problem might be?
Many thanks.

That is because because before outputting the result (i.e writing it to the file) you must encode it first:
file1.write(html.encode('utf-8'))
See every file has an attribute file.encoding. To quote the docs:
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
See the last sentence? soup.prettify returns a Unicode object and given this error, I'm pretty sure you're using Python 2.7 because its sys.getdefaultencoding() is ascii.
Hope this helps!

UnicodeEncodeError when parsing XML using cElementTree within Applescript

Apologies if this is a duplicate or something really obvious, but please bear with me as I'm new to Python. I'm trying to use cElementTree (Python 2.7.5) to parse an XML file within Applescript. The XML file contains some fields with non-ASCII text encoded as entities, such as <foo>café</foo>.
Running the following basic code in Terminal outputs pairs of tags and tag contents as expected:
import xml.etree.cElementTree as etree
parser = etree.XMLParser(encoding="utf-8")
tree = etree.parse("myfile.xml", parser=parser)
root = tree.getroot()
for child in root:
print child.tag, child.text
But when I run that same code from within Applescript using do shell script, I get the dreaded UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 10: ordinal not in range(128).
I found that if I change my print line to
print [child.tag, child.text]
then I do get a string containing XML tag/value pairs wrapped in [''], but any non-ASCII characters then get passed onto Applescript as the literal Unicode character string (so I end up with u'caf\\xe9').
I tried a couple of things, including a.) reading the .xml file into a string and using .fromstring instead of .parse, b.) trying to convert the .xml file to str before importing it into cElementTree, c.) just sticking .encode wherever I could to see if I could avoid the ASCII codec, but no solution yet. I'm stuck using Applescript as a container, unfortunately. Thanks in advance for advice!

You need to encode at least child.text into something that Applescript can handle. If you want the character entity references back, this will do it:
print child.tag.encode('ascii', 'xmlcharrefreplace'), child.text.encode('ascii', 'xmlcharrefreplace')
Or if it can handle something like utf-8:
print child.tag.encode('utf-8'), child.text.encode('utf-8')

Not AppleScript's fault - it's Python being "helpful" by guessing for you what output encoding to use. (Unfortunately, it guesses differently depending whether or not a terminal is attached.)
Simplest solution (Python 2.6+) is to set the PYTHONIOENCODING environment variable before invoking python:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python '/path/to/script.py'"
or:
do shell script "export PYTHONIOENCODING=UTF-8; /usr/bin/python << EOF
# -*- coding: utf-8 -*-
# your Python code goes here...
print u'A Møøse once bit my sister ...'
EOF"

Why all those unicode commands works CORRECT in Python? They all print my character correctly, no matter what i do

Probably I completely don't understand it, so can you take a look at code examples and tell my what should I do, to be sure it will work?
I tried it in Eclipse with Pydev. I use python 2.6.6 (becuase of some library that not support python 2.7).
First, without using codecs module
# -*- coding: utf-8 -*-
file1 = open("samoloty1.txt", "w")
file2 = open("samoloty2.txt", "w")
file3 = open("samoloty3.txt", "w")
file4 = open("samoloty4.txt", "w")
file5 = open("samoloty5.txt", "w")
file6 = open("samoloty6.txt", "w")
# I know that this is weird, but it shows that whatever i do, it not ruin anything...
print u"ą✈✈"
file1.write(u"ą✈✈")
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈".decode("utf-8")
file3.write("ą✈✈".decode("utf-8"))
print "ą✈✈".encode("utf-8")
file4.write("ą✈✈".encode("utf-8"))
print u"ą✈✈".decode("utf-8")
file5.write(u"ą✈✈".decode("utf-8"))
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
file1 = open("samoloty1.txt", "r")
file2 = open("samoloty2.txt", "r")
file3 = open("samoloty3.txt", "r")
file4 = open("samoloty4.txt", "r")
file5 = open("samoloty5.txt", "r")
file6 = open("samoloty6.txt", "r")
print file1.read()
print file2.read()
print file3.read()
print file4.read()
print file5.read()
print file6.read()
Every each of those prints works correctly and I don't get any funny characters.
Also i tried this: i delete all files made in the previous test and change only those lines:
file1 = open("samoloty1.txt", "w")
to those:
file1 = codecs.open("samoloty1.txt", "w", encoding='utf-8')
and again everything works...
Can anyone make some examples what works, and what not?
Should this be separate question?
I am downloading web pages, through this:
content = urllib.urlopen(some_url).read()
ucontent = unicode(content, encoding) # i get encoding from headers
Is this correct and enough? What should I do next with it to store it in utf-8 file? (I ask it because whatever I did before, it just works...)
** UPDATE **
Probably everything works ok because PyDev (or just Eclipse) has terminal encoded in UTF-8. So for tests i used cmd from Windows 7 and i get some errors. Now everything was crashing as expected. :D Here i am showing what i changed to get it working again (and all of those changes are reasonable for me and they agree with what i learn in answers and in docs in Python documentations).
print u"ą✈✈".encode("utf-8") # added encode
file1.write(u"ą✈✈".encode("utf-8")) # added encode
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈" # removed .decode("utf-8")
file3.write("ą✈✈") # removed .decode("utf-8"))
print "ą✈✈" # removed .encode("utf-8")
file4.write("ą✈✈") # removed .encode("utf-8"))
print u"ą✈✈".encode("utf-8") # changed from .decode("utf-8")
file5.write(u"ą✈✈".encode("utf-8")) # changed from .decode("utf-8")
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
And like someone said, when i use codecs, i not need to use encode() everytime before writing to file. :)
Question is, which answer should be marked as correct?

You are just lucky that the encoding of your console is utf-8 by default.
If you pass a unicode object to the write method method of a file object (sys.stdout) the object is implicitly decoded with its encoding attribute.
Thouse who work in Windows are not so lucky: How to workaround Python "WindowsError messages are not properly encoded" problem?

All those write exercises in the code snippet actually boil down to two situations:
when you write string to the file
when you try to write unicode string to the file
Lets call string as s and unicode string as u.
Then fileN.write(s) makes sense, and fileN.write(u) doesn't. I don't know about your setup (maybe you have made some changes to site's python), but the following expectedly breaks here:
# -*- coding: utf-8 -*-
ff = open("ff.txt", "w")
ff.write(u"ą✈✈")
ff.close()
with:
Traceback (most recent call last):
File "ex.py", line 5, in <module>
ff.write(u"ą✈✈")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It means, that unicode string should be changed to string before writing to file. And your file6 example shows how to do it:
u"ą✈✈".encode("utf-8")
The magic string -*- coding: utf-8 -*- is the one which enables you to write unicode string literals in a WYSIWYG way: u"ą✈✈", it doesn't help you to determine your encoding in any other situation.
Thus, do not give .write() method in Python2.6 any unicode string. The good practice is to work with unicode strings in your code but convert from/to concrete encoding at the input/output borders.
The codecs example is good, as well as urllib.

What you are doing is correct. See this Python unicode howto for more info.
The general principles are:
When binary data comes in to your application (e.g., open(), urllib.urlopen()), use the decode() method to get a unicode string.
If the byte string is invalid for the supplied encoding, you may get UnicodeDecodeError. In this case do one of the following:
Use the second argument to decode to either replace or ignore bad characters
try harder to find out what the real encoding is
fix the input if it really is mangled.
For files, you can use the codecs.open wrappers to do this transparently for you.
Network data you must generally decode by hand, but sometimes the payload declares its own encoding (e.g., html, XML), and sometimes it doesn't match the header!
For database data, usually the database driver will have some method of doing encoding/decoding transparently for you and always give you unicode strings. Otherwise you will need to encode/decode by hand.
Use unicode strings in your application.
Right before the binary data leaves your application, use encode() on the string to encode to your desired encoding.
If your target encoding cannot represent some of your unicode characters, you may get UnicodeEncodeError. In this case do one of the following:
Use the second argument to encode() to ignore or replace characters that can't be represented in the target encoding;
Don't generate these characters in your application.
Find an alternate way of representing them. E.g., in XML, you can use a numeric character entity.
For files, you may use the codecs.open wrapper to do encoding for you transparently.
For database connections, the driver will often have an option to accept unicode strings and encode for you.
For network connections, you must generally encode by hand. Sometimes the payload will be generated by a library that will encode properly for you (e.g., writing XML).

Because you are correctly using the magic "coding comment," everything works as supposed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.