Python - How to get accented characters correct? (BeautifulSoup) - python

I've write a s python code with BeautifulSoup to get HTML but not getting how to solve accented characters correct.
The charset of the HTML is this
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
I've this python code:
some_text = soup_ad.find("span", { "class" : "h1_span" }).contents[0]
some_text.decode('iso-8859-1','ignore')
And I get this:
Calções
What I'm doing wrong here? Some clues?
Best Regards,

The question here is about "where" do you "get this".
If that's the output received in your terminal, it might as well be possible that your terminal expects a different encoding!
You can try this when using print:
import sys
outenc = sys.stdout.encoding or sys.getfilesystemencoding()
print t.decode("iso-8859-1").encode(outenc)

As bernie points out, BS uses Unicode internally.
For BS3:
Beautiful Soup Gives You Unicode, Dammit
By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.
For BS4, the docs explain a bit more clearly when this happens:
You can pass in a string or an open filehandle… First, the document is converted to Unicode, and HTML entities are converted to Unicode characters…`
In other words, it decodes the data immediately. So, if you're getting mojibake, you have to fix it before it gets into BS, not after.
The input to the BeautifulSoup constructor can take 8-bit byte strings or files, and try to figure out the encoding. See Encodings for details. You can check whether it guessed right by printing out soup.original_encoding. If it didn't guess ISO-8859-1 or a synonym, your only option is to make it explicit: decode the string before passing it in, open the file in Unicode mode with an encoding, etc.
The results that come out of any BS object, and anything you pass as an argument to any method, will always be UTF-8 (if they're byte strings). So, calling decode('iso-8859-1') on something you got out of BS is guaranteed to break stuff if it's not already broken.
And you don't want to do this anyway. As you said in a comment, "I'm outputting to an SQLite3 database." Well, sqlite3 always uses UTF-8. (You can change this with a pragma at runtime, or change the default at compile time, but that basically breaks the Python interface, so… don't.) And the Python interface only allows UTF-8 in Py2 str (and of course in Py2 unicode/Py3 str, there is no encoding.) So, if you try to encode the BS data into Latin-1 to store in the database, you're creating problems. Just store the Unicode as-is, or encode it to UTF-8 if you must (Py2 only).
If you don't want to figure all of this out, just use Unicode everywhere after the initial call to BeautifulSoup and you'll never go wrong.

Related

TypeError: 'str' does not support the buffer interface in html2text

I'm using python3 to do some web scraping. I want to save a webpage and convert it to text using the following code:
import urllib
import html2text
url='http://www.google.com'
page = urllib.request.urlopen(url)
html_content = page.read()
rendered_content = html2text.html2text(html_content)
But when I run the code, it reports a type error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/html2text-2016.4.2-py3.4.egg/html2text/__init__.py", line 127, in feed
data = data.replace("</' + 'script>", "</ignore>")
TypeError: 'str' does not support the buffer interface
Could anyone tell me how to deal with this error? Thank you in advance!
I took the time to investigate this, and it turns out to be easily resolved.
Why You Got This Error
The problem is one of bad input: when you called page.read(), a byte string was returned, rather than a regular string.
Byte strings are Python's way of dealing with unfamiliar character encodings: basically there are characters in the raw text that don't map to Unicode (Python 3's default character encoding).
Because Python doesn't know what encoding to use, Python instead represents such strings using raw bytes - this is how all data is represented internally anyway - and lets the programmer decide what encoding to use.
Regular string methods called on these byte strings - such as replace(), which html2text tried to use - fail because byte strings don't have these methods defined.
Solution
html_content = page.read().decode('iso-8859-1')
Padraic Cunningham's solution in the comments is correct in its essence: you have to first tell Python which character encoding to use to try to map these bytes to correct character set.
Unfortunately, this particular text doesn't use Unicode, so asking it to decode using the UTF-8 encoding throws an error.
The correct encoding to use is actually contained in the request headers itself under the Content-Type header - this is a standard header that all HTTP-compliant server responses are guaranteed to provide.
Simply calling page.info().get_content_charset() returns the value of this header, which in this case is iso-8859-1. From there, you can decode it correctly using iso-8859-1, so that regular tools can operate on it normally.
A More Generic Solution
charset_encoding = page.info().get_content_charset()
html_content = page.read().decode(charset_encoding)
The stream returned by urlopen is indicated as being a bytestream by b as the first character before the quoted string. If you exclude it, as in the appended code it seems to work as input for html2txt.
import urllib
import html2text
url='http://www.google.com'
with urllib.request.urlopen(url) as page:
html_content = page.read()
charset_encoding = page.info().get_content_charset()
rendered_content = html2text.html2text(str(html_content)[1:], charset_encoding)
Revised using suggestions about encoding. Yes, it's a hack, but it runs. Not using str() means the original TypeError problem remains.

Problems with unicode, beautifulsoup, cld2, and python [duplicate]

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?
... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.
Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.
Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Python - HTML to Unicode

I have a python script where I am getting some html and parsing it using beautiful soup. In the HTML sometimes there are no unicode characters and it causes errors with my script and the file I am creating.
Here is how I am getting the HTML
html = urllib2.urlopen(url).read().replace(' ',"")
xml = etree.HTML(html)
When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError
How could I change this into unicode. So if there are non unicode characters, my code won't break.
When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError. How could I change this into unicode.
unicode characters -> bytes = ‘encode’
bytes -> unicode characters = ‘decode’
You have bytes and you want unicode characters, so the method for that is decode. As you have used encode, Python thinks you want to go from characters to bytes, so tries to convert the bytes to characters so they can be turned back to bytes! It uses the default encoding for this, which in your case is ASCII, so it fails for non-ASCII bytes.
However it is unclear why you want to do this. etree parses bytes as-is. If you want to remove character U+00A0 Non Breaking Space from your data you should do that with the extracted content you get after HTML parsing, rather than try to grapple with the HTML source version. HTML markup might include U+00A0 as raw bytes, incorrectly-unterminated entity references, numeric character references and so on. Let the HTML parser handle that for you, it's what it's good at.
If you feed HTML to BeautifulSoup, it will decode it to Unicode.
If the charset declaration is wrong or missing, or parts of the document are encoded differently, this might fail; there is a special module which comes with BeautifulSoup, dammit, which might help you with these documents.
If you mention BeautifulSoup, why don't you do it like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(url).read())
and work with the soup?
BTW, all HTML entities will be resolved to unicode characters.
The ascii character set is very limited and might lack many characters in your document. I'd use utf-8 instead whenever possible.

dealing with multiple charset in python 3

I'm using python 3.3.0 in Windows 8.
requrl = urllib.request.Request(url)
response = urllib.request.urlopen(requrl)
source = response.read()
source = source.decode('utf-8')
It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset.
So, how to deal with multiple charset?
Now let me tell you my efforts when I tried to resolve this issue like:
b1 = b'charset=iso-8859-1'
b1 = b1.decode('iso-8859-1')
if b1 in source:
source = source.decode('iso-8859-1')
It gave me an error like TypeError: Type str doesn't support the buffer API
So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(
Please, don't say that manually change charset in the source code or have you read python docs!
I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!
In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).
The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.
Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!
I recommend you either:
Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.
If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!
Here are the rules for interpreting the charset declared in a content-type header.
With no explicit charset declared:
text/* (e.g., text/html) is in ASCII.
application/* (e.g. application/json, application/xhtml+xml) is utf-8.
With an explicit charset declared:
if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
otherwise use the charset declared.
(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)
The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.
In such cases chardet can be helpful.
You're checking whether str bytes contained within bytes object:
>>> 'df' in b'df'
Traceback (most recent call last):
File "<pyshell#107>", line 1, in <module>
'df' in b'df'
TypeError: Type str doesn't support the buffer API
So, yes, it considers b1 a str, because you've decoded bytes object into a str object with the certain encoding. Instead, you should check against original value of b1. It's not clear why you do .decode on it.
Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).
There is an algorithm to take. For your purpose boils down to the following:
Check for identifying sequences for UTF-16 or UTF-8 (see provided link)
Use the character set supplied by HTTP (via the Content-Type header)
Apply the algorithm described a little later in Prescan a byte-stream to determine its encoding. This is basically searching for "charset=" in the document and extracting the value.

Parsing a utf-8 encoded web page with some gb2312 body text with Python

I'm trying to parse a web page using Python's beautiful soup Python parser, and am running into an issue.
The header of the HTML we get from them declares a utf-8 character set, so Beautiful Soup encodes the whole document in utf-8, and indeed the HTML tags are encoded in UTF-8 so we get back a nicely structured HTML page.
The trouble is, this stupid website injects gb2312-encoded body text into the page that gets parsed as utf-8 by beautiful soup. Is there a way to convert the text from this "gb2312 pretending to be utf-8" state to "proper expression of the character set in utf-8?"
The simplest way might be to parse the page twice, once as UTF-8, and once as GB2312. Then extract the relevant section from the GB2312 parse.
I don't know much about GB2312, but looking it up it appears to at least agree with ASCII on the basic letters, numbers, etc. So you should still be able to parse the HTML structure using GB2312, which would hopefully give you enough information to extract the part you need.
This may be the only way to do it, actually. In general, GB2312-encoded text won't be valid UTF-8, so trying to decode it as UTF-8 should lead to errors. The BeautifulSoup documentation says:
In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object.
This makes it sound like BeautifulSoup just ignores decoding errors and replaces the erroneous characters with U+FFFD. If this is the case (i.e., if your document has contains_replacement_characters == True), then there is no way to get the original data back from document once it's been decoded as UTF-8. You will have to do something like what I suggested above, decoding the entire document twice with different codecs.

Categories

Resources