Fastest, easiest, and best way to parse an HTML table?

Fastest, easiest, and best way to parse an HTML table? - python

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.
This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.
BeautifulSoup is the first thing that comes to mind.
Another possibility is copying/pasting it in TextMate and then running regular expressions.
What do you suggest?
This is the script that I ended up writing, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
tds = row.findAll('td')
if(len(tds)==4):
countrycode = tds[1].string
timezone = tds[2].string
if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improvement to my python code welcome, too ;)

For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.

Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
SimpleHTMLDom PHP
Hpricot & Nokogiri Ruby
Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.

I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.
This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).
EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.
It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.

While we were building SerpAPI we tested many platform/parser.
Here is the benchmark result for Python.
For more, here is a full article on Medium:
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd

The efficiency of a regex is superior to a DOM parser.
Look at this comparison:
http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team
You can find many more searching the web.

Related

Scraping Wikipedia Table providing no results

venturing into the world of python. I've done the codeacademy course and traweled through stack and youtube but hitting an issue I cant solve.
I'm attempting to do a simple print of a table located in wikipedia, failing misreably at writing my own code I decided to use a tutorial example and build off. However this isn't working and I haven't the foggest idea why.
This is the code here with the appropiate link included. My end result is an empty list "[ ]". I'm using PyCharm 2017.2, beautifulsoup 4.6.0, requests 2.18.4 & python 3.6.2. Any advice appreciated. For reference, the tutorial website is here
import requests
from bs4 import BeautifulSoup
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation"
req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wikitables = soup.findAll("table", table_classes)
print(wikitables)

You can accomplish that using regular expressions.
You get site content by requests.get(WIKI_URL).content
See source code of the site to see how Wikipedia presents tables in HTML.
Find a regular expression that can fit whole table (might be something like <table>(?P<table>*+?)</table>). What this does is get anything between <table> and </table> tokens. Good documentation for regex with python. Take a look at re.findall().
Now you are left with table data. You can use regular expressions again to get data for each row, then regex on each row to get columns. re.findall() is the key again.

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?

I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.

Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.

With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

This question already has answers here:
Don't put html, head and body tags automatically, beautifulsoup
(9 answers)
Closed 9 years ago.
In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:
from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
'<div><b>soup</b></div>'
However with BeautifulSoup4 the same operation creates additional tags:
from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
'<html><body><div><b>soup 4</b></div></body></html>'
^^^^^^^^^^^^ ^^^^^^^^^^^^^^
I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.

If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.
If you know you have a fragment, something like this will give you exactly that fragment:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
return soup4.body.next
elif soup4.html:
return soup4.html.next
else:
return soup4
Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div
If you want to know why this happens:
BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.
As Differences between parsers says:
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
So, while this exact difference isn't documented, it's just a special case of something that is.

As was noted in the old BeautifulStoneSoup documentation:
The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.
Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...
And in the BeautifulSoup4 docs:
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.
Perhaps that will yield what you want.

Is there a way to use readability and python to extract just text, not HTML?

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port.
There are a number of those.
early version by gfxmonk, based on BeautifulSoup
version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.
I use Yuri's version, as it is most recent, and seems to be in active development.
I managed to make it run on Google App Engine using Python 2.7.
Now the "problem" is that it returns HTML, whereas I need pure text.
The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.
My questions:
Is there a way to get pure text from Python Readability version that I use without forking the code?
Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?

You can use html2text. It is a nifty tool.
Here is a link on how to use it with python readability tool - together they are called read2text.
http://brettterpstra.com/scripting-readability-markdownify-for-clipping-web-pages/
Hope this helps :)

Not to let it linger, my current solution
I did not find the way to use Readability ports.
I decided to use Beautiful Soup, version 4
BS has one simple function to extract text
code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()

First, you extract the HTML contents with readability,
html_snippet = Document(html).summary()
Then, use a library to remove HTML tags. There are caveats:
1) you probably need spaces, "<p>some text<br>other text" shouldn't be "some textother text", and you might need the lists converted into " - ".
2) "#&39;" should be displayed as "'", and ">" should be displayed as ">" -- this is called HTML entities replacement (see below).
I usually use a library called bleach to clean out unnecessary tags and attributes:
cleaned_text = bleach.clean(html_snippet, tags=[])
or
cleaned_text = bleach.clean(html_snippet, tags=['i', 'b'])
You need to use any kind of html2text library if you want to remove all tags and get a better text formatting, or you can implement custom formatting procedure yourself.
But I think you now got the raw idea.
For a simple text formatting with bleach:
For example, if you want paragraphs as "\n", and list items as "\n - ", then:
norm_html = bleach.clean(html_snippet, tags=['p', 'br', 'li'])
replaced_html = norm_html.replace('<p>', '\n').replace('</p>', '\n')
replaced_html = replaced_html.replace('<br>', '\n').replace('<li>', '\n - ')
cleaned_text = bleach.clean(replaced_html, tags=[])
For a regexp that only strips HTML tags and does entities replacement (">" should be ">" and so on), you can take a look at https://stackoverflow.com/a/7778368/217895

Removing unneccessary inner tags

We are converting DOCX to HTML through some external converter tool.
The generated HTML for tables contains something like this:
<td><div><span><b>Patienten</b></span></div></td>
The <div> and <span> tags inside TD are completely superfluous here.
The expected result is
<td><b>Patienten</b></td>
Is there some chance to remove them in a sane way using BeautifulSoup?

Well, the <div> and <span> tags have a structural meaning, that cannot be automatically guessed as "superfluous".
Your problem looks very similar to AST (Abstract Syntax Tree) optimization done in compilers. You could try to define some rules and build a SoupOptimizer to take a tree (your document) and produce an optimized output tree. Rules could be:
span(content) -> content, if span.attributes is empty
div(content) -> content, if div.attributes is empty
Note, that tree transformations on XML dialects can be done with XSLT. Just be ready to have your brain turned inside out before you see the light!

The way we do it is to use lxml and determine the parents and children of every element. If there is no text content difference in the parents and children then we have a set of rules that we follow to retain certain children while tossing the parents. And then forcing the appropriate block elements In your case b is a child of span, div and td we know that the td tag is the structuring element that is relevant so we get rid of the others. Again this requires testing the text content of each of the nested elements.

You could use the strip_tags function of Jesse Dhillon's answer of this question

You could rearrange the parse tree like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<td><div><span><b>Patienten</b></span></div></td>")
td = soup.td
b = soup.td.div.span.b
td.insert(0,b)
td.div.extract()
print soup

I like the approach suggested by #Daren Thomas, but be aware that removing those "useless" tags could drastically affect the rendered appearance of the document thanks to JavaScript (less likely) or CSS (much more likely, possibly even probable) that relies on the resulting HTML to follow certain structural patterns, even if they are wasteful.
This makes the life of the tool writer much easier. Assume that some given construct in the DOCX has two possible variations. One of these requires a lot of boilerplate so you can attach a few special attributes (say a text-align or some such). The other doesn't. It's way easier to just always generate the boilerplate and write your CSS or what-have-you with that fact in mind.

If Beautiful Soup alone isn't sufficient, you can resort to regular expression.
import re
ch = 'sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week'
# <td><b>Patienten</b></td>
RE = '(<td>)<div><span>(<b>.*?</b>)</span></div>(</td>)'
pat = re.compile(RE)
print ch
print pat.sub('\\1\\2\\3',ch)
result
sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week
sunny day<td><b>Patienten</b></td>rainy week
Easy, easyn't it ?
A preliminary inspection can be done to determine if the replacement must really be done or not.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest, easiest, and best way to parse an HTML table? - python

Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure... A few other alternatives SimpleHTMLDom PHP Hpricot & Nokogiri Ruby Web::Scraper Perl/CPAN All of these are reasonably tolerant of poorly formed HTML.

While we were building SerpAPI we tested many platform/parser. Here is the benchmark result for Python. For more, here is a full article on Medium: https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd

The efficiency of a regex is superior to a DOM parser. Look at this comparison: http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team You can find many more searching the web.

Related

Scraping Wikipedia Table providing no results

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

Is there a way to use readability and python to extract just text, not HTML?

Removing unneccessary inner tags

Categories

Resources