which encoding does the python lxml module use internally? - python

When I get a webpage, I use UnicodeDammit to convert it to utf-8 encoding, just like:
import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)
but when I use:
text = doc.text_content()
print type(text)
The output is <type 'lxml.etree._ElementUnicodeResult'>.
why? I thought it would be a utf-8 string.

lxml.etree._ElementUnicodeResult is a class that inherits from unicode:
$ pydoc lxml.etree._ElementUnicodeResult
lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
| Method resolution order:
| _ElementUnicodeResult
| __builtin__.unicode
| __builtin__.basestring
| __builtin__.object
In Python, it's fairly common to have classes that extend from base types to add some module-specific functionality. It should be safe to treat the object like a regular Unicode string.

You might want to skip the re-encoding step, as lxml.html will automatically use the encoding specified in the source file, and as long as it ends up as valid unicode, there's perhaps no reason to be concerned with how it was initially encoded.
Unless your project is so small and informal that you can be sure you will never encounter 8-bit strings (i.e. it's always 7-bit ASCII, English with no special characters), it's wise to get your text into unicode as early as possible (like right after retrieval) and keep it that way until you need to serialize it for writing to a file or sending over a socket.
The reason you're seeing <type 'lxml.etree._ElementUnicodeResult'> is because lxml.html.fromstring() is automatically doing the decode step for you. Note this means your code above will not work for a page encoded with UTF-16, for example, since the 8-bit string will be encoded in UTF-8 but the html will still be saying utf-16
<meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
and lxml will try to decode the string based on utf-16 encoding rules, raising an exception in short order I would expect.
If you want the output serialized as a UTF-8 encoded 8-bit string, all you need is this:
>>> text = doc.text_content().encode('utf-8')
>>> print type(text)
<type 'str'>

Related

Python - HTML to Unicode

I have a python script where I am getting some html and parsing it using beautiful soup. In the HTML sometimes there are no unicode characters and it causes errors with my script and the file I am creating.
Here is how I am getting the HTML
html = urllib2.urlopen(url).read().replace(' ',"")
xml = etree.HTML(html)
When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError
How could I change this into unicode. So if there are non unicode characters, my code won't break.
When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError. How could I change this into unicode.
unicode characters -> bytes = ‘encode’
bytes -> unicode characters = ‘decode’
You have bytes and you want unicode characters, so the method for that is decode. As you have used encode, Python thinks you want to go from characters to bytes, so tries to convert the bytes to characters so they can be turned back to bytes! It uses the default encoding for this, which in your case is ASCII, so it fails for non-ASCII bytes.
However it is unclear why you want to do this. etree parses bytes as-is. If you want to remove character U+00A0 Non Breaking Space from your data you should do that with the extracted content you get after HTML parsing, rather than try to grapple with the HTML source version. HTML markup might include U+00A0 as raw bytes, incorrectly-unterminated entity references, numeric character references and so on. Let the HTML parser handle that for you, it's what it's good at.
If you feed HTML to BeautifulSoup, it will decode it to Unicode.
If the charset declaration is wrong or missing, or parts of the document are encoded differently, this might fail; there is a special module which comes with BeautifulSoup, dammit, which might help you with these documents.
If you mention BeautifulSoup, why don't you do it like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(url).read())
and work with the soup?
BTW, all HTML entities will be resolved to unicode characters.
The ascii character set is very limited and might lack many characters in your document. I'd use utf-8 instead whenever possible.

Python unicode force convert to ascii (str)

when using post in django, a ascii string will be automatic transfer into unicode string.
for example:
s = '\xe2\x80\x99'
is a str type string. (Which is utf-8 format)
when post this string to django, and then get it from request.POST, it is transferred to unicode string:
u'\xe2\x80\x99'
this may cause decode/encode error, because python thought it was a unicode string, but it is a utf-8 string in fact.
My question is how to FORCE convert unicode string to ascii string? Which means just remove the pre 'u' from u'\xe2\x80\x99' to '\xe2\x80\x99'. The traditional method like decode and encode may not work in this situation.
When receiving the request, the encoding of the response is mis-declared as (probably) iso-8859-1, or perhaps not declared at all and defaulting to that encoding. The web site should declare its encoding correctly with a header:
<headers>
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
</headers>
But if that isn't under your control, you can undo the encoding and decode it correctly:
>>> s = u'\xe2\x80\x99'
>>> s.encode('iso-8859-1')
'\xe2\x80\x99'
>>> s.encode('iso-8859-1').decode('utf8')
u'\u2019'

Python-unexpected behavior when I don't decode to utf-8

I have the following function
import urllib.request
def seek():
web = urllib.request.urlopen("http://wecloudforyou.com/")
text = web.read().decode("utf8")
return text
texto = seek()
print(texto)
When I decode to utf-8, I get the html code with indentation and carriage returns and all, just like it's seen on the actual website.
<!DOCTYPE html>
<html>
<head>
<title>We Cloud for You |
If I remove .decode('utf8'), I get the code, but the indentation is gone and it's replaced by \n.
<!DOCTYPE html>\n<html>\n <head>\n <title>We Cloud for You
So, why is this happening? As far as I know, when you decode, you are basically converting some encoded string into Unicode.
My sys.stdout.encoding is CP1252 (Windows 1252 encoding)
According to this thread: Why does Python print unicode characters when the default encoding is ASCII?
Python outputs non-unicode strings as raw data, without considering
its default encoding. The terminal just happens to display them if its
current encoding matches the data. - Python outputs Unicode strings
after encoding them using the scheme specified in sys.stdout.encoding.
- Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the
terminal's encoding is independant from the shell's.
So, it seems like python needs to read the text in Unicode before it can convert it to CP1252 and then it's printed on the terminal. But I don't understand why if the text is not decoded, it replaces the indentation with \n.
sys.getdefaultencoding() returns utf8.
In Python 3, when you pass a byte value (raw bytes from the network without decoding) you get to see the representation of the byte value as a Python byte literal. This includes representing newlines as \n characters.
By decoding, you now have a unicode string value instead, and print() can handle that directly:
>>> print(b'Newline\nAnother line')
b'Newline\nAnother line'
>>> print(b'Newline\nAnother line'.decode('utf8'))
Newline
Another line
This is perfectly normal behaviour.

Python - How to get accented characters correct? (BeautifulSoup)

I've write a s python code with BeautifulSoup to get HTML but not getting how to solve accented characters correct.
The charset of the HTML is this
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
I've this python code:
some_text = soup_ad.find("span", { "class" : "h1_span" }).contents[0]
some_text.decode('iso-8859-1','ignore')
And I get this:
Calções
What I'm doing wrong here? Some clues?
Best Regards,
The question here is about "where" do you "get this".
If that's the output received in your terminal, it might as well be possible that your terminal expects a different encoding!
You can try this when using print:
import sys
outenc = sys.stdout.encoding or sys.getfilesystemencoding()
print t.decode("iso-8859-1").encode(outenc)
As bernie points out, BS uses Unicode internally.
For BS3:
Beautiful Soup Gives You Unicode, Dammit
By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.
For BS4, the docs explain a bit more clearly when this happens:
You can pass in a string or an open filehandle… First, the document is converted to Unicode, and HTML entities are converted to Unicode characters…`
In other words, it decodes the data immediately. So, if you're getting mojibake, you have to fix it before it gets into BS, not after.
The input to the BeautifulSoup constructor can take 8-bit byte strings or files, and try to figure out the encoding. See Encodings for details. You can check whether it guessed right by printing out soup.original_encoding. If it didn't guess ISO-8859-1 or a synonym, your only option is to make it explicit: decode the string before passing it in, open the file in Unicode mode with an encoding, etc.
The results that come out of any BS object, and anything you pass as an argument to any method, will always be UTF-8 (if they're byte strings). So, calling decode('iso-8859-1') on something you got out of BS is guaranteed to break stuff if it's not already broken.
And you don't want to do this anyway. As you said in a comment, "I'm outputting to an SQLite3 database." Well, sqlite3 always uses UTF-8. (You can change this with a pragma at runtime, or change the default at compile time, but that basically breaks the Python interface, so… don't.) And the Python interface only allows UTF-8 in Py2 str (and of course in Py2 unicode/Py3 str, there is no encoding.) So, if you try to encode the BS data into Latin-1 to store in the database, you're creating problems. Just store the Unicode as-is, or encode it to UTF-8 if you must (Py2 only).
If you don't want to figure all of this out, just use Unicode everywhere after the initial call to BeautifulSoup and you'll never go wrong.

Convert Unicode to ASCII without errors in Python

My code just scrapes a web page, then converts it to Unicode.
html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)
But I get a UnicodeDecodeError:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
handler.get(*groups)
File "/Users/greg/clounce/main.py", line 55, in get
html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?
>>> u'aあä'.encode('ascii', 'ignore')
'a'
Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.
The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:
>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'
See https://docs.python.org/3/library/stdtypes.html#str.encode
As an extension to Ignacio Vazquez-Abrams' answer
>>> u'aあä'.encode('ascii', 'ignore')
'a'
It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'
You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.
>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"
Although there are more efficient ways to accomplish this. See this question for more details Where is Python's "best ASCII for this Unicode" database?
2018 Update:
As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte
In order to decode a gzpipped response you need to add the following modules (in Python 3):
import gzip
import io
Note: In Python 2 you'd use StringIO instead of io
Then you can parse the content out like this:
response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.
Original Answer from 2010:
Can we get the actual value used for link?
In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xa0'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.
Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).
As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().
Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).
Use unidecode - it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.
$ pip install unidecode
then:
>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'
I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.
from django.utils import encoding
def convert_unicode_to_string(x):
"""
>>> convert_unicode_to_string(u'ni\xf1era')
'niera'
"""
return encoding.smart_str(x, encoding='ascii', errors='ignore')
I no longer get any unicode errors after using this.
For broken consoles like cmd.exe and HTML output you can always use:
my_unicode_string.encode('ascii','xmlcharrefreplace')
This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.
WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.
And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').
You wrote """I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere."""
The HTML is NOT expected to contain any kind of "attempt at unicode", well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front ... look for "charset".
You appear to be assuming that the charset is UTF-8 ... on what grounds? The "\xA0" byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.
If you can't get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.
Why have you tagged your question with "regex"?
Update after you replaced your whole question with a non-question:
html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.
html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)
If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.
line = 'my big string'
line.encode('ascii', 'ignore')
For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html
I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
Let's take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )
1/10/17, 21:36 - Land : Welcome ��
and we want to ignore and preserve only ascii characters.
This code will do:
import unicodedata
fp = open(<FILENAME>)
for line in fp:
rline = line.strip()
rline = unicode(rline, "utf-8")
rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
if len(rline) != 0:
print rline
and type(rline) will give you
>type(rline)
<type 'str'>
unicodestring = '\xa0'
decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')
Works for me
You can use the following piece of code as an example to avoid Unicode to ASCII errors:
from anyascii import anyascii
content = "Base Rent for – CC# 2100 Acct# 8410: $41,667.00 – PO – Lines - for Feb to Dec to receive monthly"
content = anyascii(content)
print(content)
Looks like you are using python 2.x.
Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.
Just paste the below line after shebang, it will work
# -*- coding: utf-8 -*-

Categories

Resources