Complex HTML parsing with Python - python

I am already aware of tag based HTML parsing in Python using BeautifulSoup, htmllib etc.
However, I want a powerful engine which can do complex tasks like read html tables, lists etc. and present these as simple to use objects within code. Does python have such powerful libraries?

BeautifulSoup is a nice library and provides a good way to parse HTML with some handy ways to parse the data very easily.
What you are trying to do, can easily be done using some simple regular expressions. You can write regular expressions to search for a particular pattern of data and extract the data you need.

You might consider lxml which has a powerful HTML processor. There is another complementary module that relies on lxml called pyquery that might be just what you're looking for.
PyQuery has jQuery-like syntax, so if you're used to jQuery you'll be able to jump right in.
Here is a simple example to get the first <ul> item from aol.com:
>>> from pyquery import PyQuery as pq
>>> import urllib
>>> data = urllib.urlopen('http://aol.com').read()
>>> d = pq(data)
>>> first_ul = d('ul:first')
>>> first_ul
[<ul#dhL2>]
>>> print first_ul
<ul id="dhL2"><li class="dhL1"><a accesskey="" href="https://new.aol.com/productsweb/?promocode=827693&ncid=txtlnkuswebr00000074" name="om_dirbtn1" class="_o4-0" id="om_dirbtn1">Get Free Mail</a></li>
</ul>

The standard HTML parsers are already pretty good at giving you simple objects (e.g. iterables). Creating anything more complex than a 2D list from a table would likely be dependent on the data that was in the page.
With that said...
Here's a link to a blog post by someone who wrote a script to convert html tables to python lists. The actual file is located here.
I've never heard of a standard python library that does these sorts of operations, so your best bet might be Googling each case as you need it. Chances are someone has done what you are trying to do.
Disclaimer: You should always read and understand any code you find online before pasting it into your own applications! Citing who/where it's from is good too!

Related

How to perform a Google search and take the text result?

Wondering how to use Python 3 to use Google to create a dictionary of some words (so say I enter a word, I want Python to take the definition that Google is able to give, then store or display it)
I haven't done much coding, but I know how to manage the words after. I'm just a bit confused using urllib and stuff. I have only been able to find help for this on other versions of Python, which I have not been able to replicate on Python 3.3.
EDIT: Yes, I want to use Google because I like the way it defines words and phrases, and I plan to use the define protocol you mentioned, icedtrees.
Edit: it appears that Google Search grabs its definitions using AJAX calls or something. The below solution will not work.
If you are having trouble using urllib2, I suggest the nice Python Requests package, which is a lot easier to use.
If you are absolutely committed to getting the Google definition and no other definition, I would suggest doing a HTTP request to a page using the Google Search "define" protocol.
For example:
https://www.google.com.au/search?q=define:test
You would then save the HTML result, and then parse it for the definitions that you require. Some examples of Python HTML parsers are the HTMLParser module, and also BeautifulSoup. However, this parsing operation seems pretty simple, so a basic regex should be more than enough. All definitions are stored as follows:
<div style="display:inline" data-dobid="dfn"> # the order of the style and the data-dobid can change
<span>definition goes here</span>
</div>
An example of a regex to grab the definitions of "test" from the HTML page:
import re
definitions = re.findall(r'data-dobid="dfn".*?>.*?\<span>(.*?)</span>.*?</div>', html, re.DOTALL)
>>> len(definitions)
18
>>> definitions[0]
'a\n procedure intended to establish the quality, performance, or \nreliability of something, especially before it is taken into widespread \nuse.'
# Looks like you might need to remove the newlines
>>> definitions[5]
'the result of a medical examination or analytical procedure.'
As a sidenote, there also exists a Google Dictionary API, which can give you definition results in JSON format in response to a request.

Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.

Python HTML parsing

I am currently trying to make a program that given a word will look up its definition and return it. Although I have gotten this to work, I had to resort to using RegEx to search for the text between the tags where the definitions are stored. What is a more efficient way to do this using python 3.x?
lxml works for Python 3. It has an ElementTree compatible API, but is using c libraries behind the scenes, so it's fast, and it supports Xpaths, which is a nice way of parsing (sometimes).
Try BeautifulSoup a good HTML parser for Python. (works with Python 3.x too, although unless you are deep into a Python 3.0 project, consider using 2.7)
Your's a pretty simple requirement when it comes to HTML parsing. Python standard library includes ElementTree module which should be helpful to do the task which you are planning to undertake. Look for the example snippet which is given in that page.
Also, never make the mistake of parsing HTML/XML using regex. You may not know when it will get insanely complicated and it is a bad idea under any situation too.

batch script or python program to edit string in xml tags

I am looking to write a program that searches for the tags in an xml document and changes the string between the tags from localhost to manager. The tag might appear in the xml document multiple times, and the document does have a definite path. Would python or vbscript make the most sense for this problem? And can anyone provide a template so I can get started? That would be great. Thanks.
If it's a simple thing, like changing a few strings here and there, you might be able to do everything with a python regexp, check here:
http://docs.python.org/library/re.html
For everything more complex, I would suggest using something like Beautiful Soup:
http://www.crummy.com/software/BeautifulSoup/documentation.html
It's a bit outdated, but contains everything you would ever need...
I agree this belongs to stackoverflow.com, as it's a programming question.
I suggest that you go straight to lxml library for python and don't look back. The regex manipulation of xml can have terrible consequences, and BeautifulSoup, although quite popular, is officially abandoned.
lxml is quite powerfull, fast and efficient. For your task, it is sufficient to write:
from lxml import etree
doc = etree.fromstring(content)
elements = doc.findall('tags_to_modify')
for el in elements:
el.text = your_replacement_function(el.text)
print etree.tostring(doc)
You can find a lot of help in lxml's documentation:
http://lxml.de/

XML Processing in Python

I am about to build a piece of a project that will need to construct and post an XML document to a web service and I'd like to do it in Python, as a means to expand my skills in it.
Unfortunately, whilst I know the XML model fairly well in .NET, I'm uncertain what the pros and cons are of the XML models in Python.
Anyone have experience doing XML processing in Python? Where would you suggest I start? The XML files I'll be building will be fairly simple.
Personally, I've played with several of the built-in options on an XML-heavy project and have settled on pulldom as the best choice for less complex documents.
Especially for small simple stuff, I like the event-driven theory of parsing rather than setting up a whole slew of callbacks for a relatively simple structure. Here is a good quick discussion of how to use the API.
What I like: you can handle the parsing in a for loop rather than using callbacks. You also delay full parsing (the "pull" part) and only get additional detail when you call expandNode(). This satisfies my general requirement for "responsible" efficiency without sacrificing ease of use and simplicity.
ElementTree has a nice pythony API. I think it's even shipped as part of python 2.5
It's in pure python and as I say, pretty nice, but if you wind up needing more performance, then lxml exposes the same API and uses libxml2 under the hood. You can theoretically just swap it in when you discover you need it.
There are 3 major ways of dealing with XML, in general: dom, sax, and xpath. The dom model is good if you can afford to load your entire xml file into memory at once, and you don't mind dealing with data structures, and you are looking at much/most of the model. The sax model is great if you only care about a few tags, and/or you are dealing with big files and can process them sequentially. The xpath model is a little bit of each -- you can pick and choose paths to the data elements you need, but it requires more libraries to use.
If you want straightforward and packaged with Python, minidom is your answer, but it's pretty lame, and the documentation is "here's docs on dom, go figure it out". It's really annoying.
Personally, I like cElementTree, which is a faster (c-based) implementation of ElementTree, which is a dom-like model.
I've used sax systems, and in many ways they're more "pythonic" in their feel, but I usually end up creating state-based systems to handle them, and that way lies madness (and bugs).
I say go with minidom if you like research, or ElementTree if you want good code that works well.
I've used ElementTree for several projects and recommend it.
It's pythonic, comes 'in the box' with Python 2.5, including the c version cElementTree (xml.etree.cElementTree) which is 20 times faster than the pure Python version, and is very easy to use.
lxml has some perfomance advantages, but they are uneven and you should check the benchmarks first for your use case.
As I understand it, ElementTree code can easily be ported to lxml.
It depends a bit on how complicated the document needs to be.
I've used minidom a lot for writing XML, but that's usually been just reading documents, making some simple transformations, and writing them back out. That worked well enough until I needed the ability to order element attributes (to satisfy an ancient application that doesn't parse XML properly). At that point I gave up and wrote the XML myself.
If you're only working on simple documents, then doing it yourself can be quicker and simpler than learning a framework. If you can conceivably write the XML by hand, then you can probably code it by hand as well (just remember to properly escape special characters, and use str.encode(codec, errors="xmlcharrefreplace")). Apart from these snafus, XML is regular enough that you don't need a special library to write it. If the document is too complicated to write by hand, then you should probably look into one of the frameworks already mentioned. At no point should you need to write a general XML writer.
You can also try untangle to parse simple XML documents.
Since you mentioned that you'll be building "fairly simple" XML, the minidom module (part of the Python Standard Library) will likely suit your needs. If you have any experience with the DOM representation of XML, you should find the API quite straight forward.
I write a SOAP server that receives XML requests and creates XML responses. (Unfortunately, it's not my project, so it's closed source, but that's another problem).
It turned out for me that creating (SOAP) XML documents is fairly simple if you have a data structure that "fits" the schema.
I keep the envelope since the response envelope is (almost) the same as the request envelope. Then, since my data structure is a (possibly nested) dictionary, I create a string that turns this dictionary into <key>value</key> items.
This is a task that recursion makes simple, and I end up with the right structure. This is all done in python code and is currently fast enough for production use.
You can also (relatively) easily build lists as well, although depending upon your client, you may hit problems unless you give length hints.
For me, this was much simpler, since a dictionary is a much easier way of working than some custom class. For the books, generating XML is much easier than parsing!
For serious work with XML in Python use lxml
Python comes with ElementTree built-in library, but lxml extends it in terms of speed and functionality (schema validation, sax parsing, XPath, various sorts of iterators and many other features).
You have to install it, but in many places, it is already assumed to be part of standard equipment (e.g. Google AppEngine does not allow C-based Python packages, but makes an exception for lxml, pyyaml, and few others).
Building XML documents with E-factory (from lxml)
Your question is about building XML document.
With lxml there are many methods and it took me a while to find the one, which seems to be easy to use and also easy to read.
Sample code from lxml doc on using E-factory (slightly simplified):
The E-factory provides a simple and compact syntax for generating XML and HTML:
>>> from lxml.builder import E
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!"),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... )
... )
... )
>>> print(etree.tostring(page, pretty_print=True))
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1>Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
link.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
</body>
</html>
I appreciate on E-factory it following things
Code reads almost as the resulting XML document
Readability counts.
Allows creation of any XML content
Supports stuff like:
use of namespaces
starting and ending text nodes within one element
functions formatting attribute content (see func CLASS in full lxml sample)
Allows very readable constructs with lists
e.g.:
from lxml import etree
from lxml.builder import E
lst = ["alfa", "beta", "gama"]
xml = E.root(*[E.record(itm) for itm in lst])
etree.tostring(xml, pretty_print=True)
resulting in:
<root>
<record>alfa</record>
<record>beta</record>
<record>gama</record>
</root>
Conclusions
I highly recommend reading lxml tutorial - it is very well written and will give you many more reasons to use this powerful library.
The only disadvantage of lxml is, that it must be compiled. See SO answer for more tips how to install lxml from wheel format package within a fraction of a second.
I strongly recommend SAX - Simple API for XML - implementation in the Python libraries. They are fairly easy to setup and process large XML by even driven API, as discussed by previous posters here, and have low memory footprint unlike validating DOM style XML parsers.
If you're going to be building SOAP messages, check out soaplib. It uses ElementTree under the hood, but it provides a much cleaner interface for serializing and deserializing messages.
I assume that the .NET way of processing XML builds on some version of MSXML and in that case I assume that using, for example, minidom would make you feel somewhat at home. However, if it is simple processing you are doing, any library will probably do.
I also prefer working with ElementTree when dealing with XML in Python because it is a very neat library.

Categories

Resources