BeautifulSoup for Mandarin - python

I'm trying to scrape a site in Mandarin using BeautifulSoup. Unfortunately, when I do, BeautifulSoup finds the html, head, and body tags, but everything in between the opening and closing body tags is gibberish. I've tried using multiple parsers, and as far as I can tell only html5lib is able to find all of the page because it returns by far the longest result. So I think I'm using the right parser, but the encoding is wrong. The website lists 'gb2312' as its encoding, but using that encoding, it is still gibberish. I also tried chardet to determine the encoding, which returned 'windows-1252', but it also didn't seem correct. Indeed I have gone through many of the standard Chinese character encodings (found here), but none of them return anything coherent, although some have one or two Chinese characters. I also created a output file for every possible python encoding, but it looks like none of them are correct.
Other than going through the different encodings, I'm not sure what else to try. Any help would be greatly appreciated, thanks!

Never mind! I guess it was an encoding issue, but mainly that the requests library is far better than urllib! Sorry about that.

Related

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

Python detect broken encoding

After crawling many websites, in some of them i receive broken-encoding data. I can't do anything with them, i just need to detect them. For example text like:
·ç¼wÃdª«¦Ê³f
or
ãà³n³¾å¢
How can I recognize text like that ? I any language, so searching for non-english text is not an option. The only option I can think of is guess-language module.
There's NLTK which has a guess_encoding function that takes a byte string and tries all of the available encodings, would this serve your purpose?
Take a look at https://github.com/LuminosoInsight/python-ftfy
If I understand correctly, it will attempt to 'repair' incorrectly encoded/decoded text.

Determining charset from html meta tags w/python

I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that's the normal assumed charset for this right?) if it can't be found and search the html for the meta tag with the charset attribute. However I'm not sure the best way to do that. I could try to create an etree with lxml, but I don't want to read the whole file since I may run into encoding problems. However, if I don't read the whole file I can't build an etree since some tags will not have been closed.
Should I just find the meta tag with some fancy string subscripting and break out of the loop once it's found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.
You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.
html_data=html_data.decode("UTF-8","ignore")
html_data=html_data.encode("UTF-8","ignore")
After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding.
This way, you'll be able to find the correct encoding defined in the HTML headers.
After finding the encoding, you'll have to re-parse the HTML document with proper encoding.
Unfortunately, sometimes you might not find character encoding even in the HTML headers. I'd suggest you using the chardet module to find the proper encoding only after these steps fail.
Determining the character encoding of an HTML file correctly is actually quite a complex matter, but the HTML5 spec defines exactly how a processor should do it. You can find the algorithm here: http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.

Python SAX parser says XML file is not well-formed

I stripped some tags that I thought were unnecessary from an XML file. Now when I try to parse it, my SAX parser throws an error and says my file is not well-formed. However, I know every start tag has an end tag. The file's opening tag has a link to an XML schema. Could this be causing the trouble? If so, then how do I fix it?
Edit: I think I've found the problem. My character data contains "&lt" and "&gt" characters, presumably from html tags. After being parsed, these are converted to "<" and ">" characters, which seems to bother the SAX parser. Is there any way to prevent this from happening?
I would suggest putting those tags back in and making sure it still works. Then, if you want to take them out, do it one at a time until it breaks.
However, I question the wisdom of taking them out. If it's your XML file, you should understand it better. If it's a third-party XML file, you really shouldn't be fiddling with it (until you understand it better :-).
Does the sax parser not give you details about where it thinks it's not well-formed?
Have you tried loading the file into an XML editor and checking it there? Do other XML parsers accept it?
The schema shouldn't change whether or not the XML is well-formed or not; it may well change whether it's valid or not. See the wikipedia entry for XML well-formedness for a little bit more, or the XML specs for a lot more detail :)
EDIT: To represent "&" in text, you should escape it as &
So:
&lt
should be
&lt
(assuming you really want ampersand, l, t).
I would second recommendation to try to parse it using another XML parser. That should give an indication as to whether it's the document that's wrong, or parser.
Also, the actual error message might be useful. One fairly common problem for example is that the xml declaration (if one is used, it's optional) must be the very first thing -- not even whitespace is allowed before it.
You could load it into Firefox, if you don't have an XML editor. Firefox shows you the error.

Categories

Resources