Removing unneccessary inner tags - python

We are converting DOCX to HTML through some external converter tool.
The generated HTML for tables contains something like this:
<td><div><span><b>Patienten</b></span></div></td>
The <div> and <span> tags inside TD are completely superfluous here.
The expected result is
<td><b>Patienten</b></td>
Is there some chance to remove them in a sane way using BeautifulSoup?

Well, the <div> and <span> tags have a structural meaning, that cannot be automatically guessed as "superfluous".
Your problem looks very similar to AST (Abstract Syntax Tree) optimization done in compilers. You could try to define some rules and build a SoupOptimizer to take a tree (your document) and produce an optimized output tree. Rules could be:
span(content) -> content, if span.attributes is empty
div(content) -> content, if div.attributes is empty
Note, that tree transformations on XML dialects can be done with XSLT. Just be ready to have your brain turned inside out before you see the light!

The way we do it is to use lxml and determine the parents and children of every element. If there is no text content difference in the parents and children then we have a set of rules that we follow to retain certain children while tossing the parents. And then forcing the appropriate block elements In your case b is a child of span, div and td we know that the td tag is the structuring element that is relevant so we get rid of the others. Again this requires testing the text content of each of the nested elements.

You could use the strip_tags function of Jesse Dhillon's answer of this question

You could rearrange the parse tree like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<td><div><span><b>Patienten</b></span></div></td>")
td = soup.td
b = soup.td.div.span.b
td.insert(0,b)
td.div.extract()
print soup

I like the approach suggested by #Daren Thomas, but be aware that removing those "useless" tags could drastically affect the rendered appearance of the document thanks to JavaScript (less likely) or CSS (much more likely, possibly even probable) that relies on the resulting HTML to follow certain structural patterns, even if they are wasteful.
This makes the life of the tool writer much easier. Assume that some given construct in the DOCX has two possible variations. One of these requires a lot of boilerplate so you can attach a few special attributes (say a text-align or some such). The other doesn't. It's way easier to just always generate the boilerplate and write your CSS or what-have-you with that fact in mind.

If Beautiful Soup alone isn't sufficient, you can resort to regular expression.
import re
ch = 'sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week'
# <td><b>Patienten</b></td>
RE = '(<td>)<div><span>(<b>.*?</b>)</span></div>(</td>)'
pat = re.compile(RE)
print ch
print pat.sub('\\1\\2\\3',ch)
result
sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week
sunny day<td><b>Patienten</b></td>rainy week
Easy, easyn't it ?
A preliminary inspection can be done to determine if the replacement must really be done or not.

Related

Regular Expressions or BeautifulSoup - Varying Cases

I have 3 strings I am looking to retrieve that are characterized by the presence of two words: section and front. I'm terrible with regex.
contentFrame wsj-sectionfront economy_sf
contentFrame wsj-sectionfront business_sf
section-front markets
How can I match both of these words using one regular expression? This will be used to match the contents of a html page parsed by BeautifulSoup.
UPDATE:
I want to extract the main body of a webpage (https://www.wsj.com/news/business) that has the div tag: Main Content Housing. For some reason, BeautifulSoup isn't recognizing the highlighted class attribute using:
wsj_soup.find('div', attrs = {'class':'contentFrame wsj-sectionfront business_sf')
# Returns []
I'm trying to stay in BeautifulSoup as much as possible, but if regex is the way to go I will use that. From there I will more than likely search using the contents attribute to search for relevant keywords, but if anyone has a better idea of how to approach it please share.
One way to handle this would be to use two separate lookaheads which check for each of these words:
^(?=.*section)(?=.*front).*$
Demo

Extract text between tags with XPath including markup

I have the following piece of XML:
...<span class="st">In Tim <em>Power</em>: Politieman...</span>...
I want to extract the part between the <span> tags.
For this I use XPath:
/span[#class="st"]
This however will extract everything including the <span>.
and.
/span[#class="st"]/text()
will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.
Is there a pure XPath solution which returns:
In Tim <em>Power</em>: Politieman...
EDIT
Thanks to #helderdarocha and #TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.
The /span[#class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.
To get any child node you can use:
/span[#class="st"]/node()
This will return:
Two child text nodes
The full <em> node (element and contents).
If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:
/span[#class="st"]//text()
or
/span[#class="st"]/descendant::text()
This will return three text nodes, the text inside <em>, but not the <em> elements.
Sounds like you want the equivalent of the Javascript DOM innerHTML() function, but for XML. I don't think there's a way to do that in pure XPath.
XPath doesn't really operate on markup strings like "<em>" and "</em>" at all -- it works with a tree of Node objects (there might possibly be an XPath implementation that tries to work directly off markup, but I doubt it). Most XPath implementations wouldn't even have the 4 characters "<em>" anywhere (except maybe kept around for printing error messages or something), and of course the DOM could have been built from scratch rather than from XML or other input in the first place. Likewise, XPath doesn't really figure on handing back marked-up strings, but lists of nodes.
In XSLT or XQuery you can do this easily, but not in XPath by itself, unless I'm missing something.
-s

What is an ElementTree object exactly, and how can I get data from it?

I'm trying to teach myself how to parse XML. I've read the lxml tutorials, but they're hard to understand. So far, I can do:
>>> from lxml import etree
>>> xml=etree.parse('ham.xml')
>>> xml
<lxml.etree._ElementTree object at 0x118de60>
But how can I get data from this object? It can't be indexed like xml[0], and it can't be iterated over.
More specifically, I'm using this xml file and I'm trying to extract, say, everything between the <l> tags that's surrounded by <sp> tags that contain, say, the Barnardo attribute.
It is a ElementTree Element object.
You can also look at the lxml API documentation, which has an lxml.etree._Element page. That page tells you about every single attribute and method on that class you could ever want to know about.
I'd start with reading the lxml.etree tutorial, however.
If the element cannot be indexed, however, it is an empty tag, and there are no child nodes to retrieve.
To find all lines by Bernardo, an XPath expression is needed, with a namespace map. It doesn't matter what prefix you use, as long as it is a non-empty string lxml will map it to the correct namespace URL:
nsmap = {'s': 'http://www.tei-c.org/ns/1.0'}
for line in tree.xpath('.//s:sp[#who="Barnardo"]/s:l/text()', namespaces=nsmap):
print line.strip()
This extracts all text in <l> elements that are contained in <sp who="Barnardo"> tags. Note the s: prefixes on the tag names, the nsmap dictionary tells lxml what namespace to use. I printed these without the surrounding extra whitespace.
For your sample document, that gives:
>>> for line in tree.xpath('.//s:sp[#who="Barnardo"]/s:l/text()', namespaces=nsmap):
... print line.strip()
...
Who's there?
Long live the king!
He.
'Tis now struck twelve; get thee to bed, Francisco.
Have you had quiet guard?
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
Say,
What, is Horatio there?
Welcome, Horatio: welcome, good Marcellus.
I have seen nothing.
Sit down awhile;
And let us once again assail your ears,
That are so fortified against our story
What we have two nights seen.
Last night of all,
When yond same star that's westward from the pole
Had made his course to illume that part of heaven
Where now it burns, Marcellus and myself,
The bell then beating one,
In the same figure, like the king that's dead.
Looks 'a not like the king? mark it, Horatio.
It would be spoke to.
See, it stalks away!
How now, Horatio! you tremble and look pale:
Is not this something more than fantasy?
What think you on't?
I think it be no other but e'en so:
Well may it sort that this portentous figure
Comes armed through our watch; so like the king
That was and is the question of these wars.
'Tis here!
It was about to speak, when the cock crew.
One way to parse the XML is using XPath. You can call the xpath() member function for an ElementTree, in your case xml.
As an example, to print the XML for all the <l> elements (lines of the play).
subtrees = xml.xpath('//l', namespaces={'prefix': 'http://www.tei-c.org/ns/1.0'})
for l in subtrees:
print(etree.tostring(l))
The lxml docs detail the xpath functionality.
As pointed out below this doesn't work unless a namespace is specified. Unfortunately the empty namespace is not supported by lxml, but you can change the root node to use a namespace named prefix, which is also the name used above.
<TEI xmlns:prefix="http://www.tei-c.org/ns/1.0" xml:id="sha-ham">

Parse XML in order using Python

I'm trying to parse an XML document. The document has HTML like formatting embedded, for example
<p>This is a paragraph
<em>with some <b>extra</b> formatting</em>
scattered throughout.
</p>
So far I've used
import xml.etree.cElementTree as xmlTree
to handle the XML document, but I am not sure if this provides the functionality I look for. How would I go about handling the text nodes here?
Also, is there a way to find the closing tags in a document?
Thanks!
If your XML document fits in memory, you should use Beautiful Soup which will give you a much cleaner access to the document. You'll be able to select a node and automatically interact with its children; every node will have a .next command, which will iterate through the text up to the next tag.
So:
>>> b = BeautifulSoup.BeautifulStoneSoup("<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>")
>>> b.find('p')
<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>
>>> b.find('p').next
u'This is a paragraph '
>>> b.find('p').next.next
<em>with some <b>extra</b> formatting</em>
That, or something like it, should solve your problem.
If it doesn't fit in memory, you'll need to subclass a SAX parser, which is a bit more work. To do that, you use from xml.parsers import expat and write handlers for opening and closing of tags. It's a bit more involved.

Fastest, easiest, and best way to parse an HTML table?

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.
This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.
BeautifulSoup is the first thing that comes to mind.
Another possibility is copying/pasting it in TextMate and then running regular expressions.
What do you suggest?
This is the script that I ended up writing, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
tds = row.findAll('td')
if(len(tds)==4):
countrycode = tds[1].string
timezone = tds[2].string
if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improvement to my python code welcome, too ;)
For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.
Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
SimpleHTMLDom PHP
Hpricot & Nokogiri Ruby
Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.
I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.
This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).
EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.
It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.
While we were building SerpAPI we tested many platform/parser.
Here is the benchmark result for Python.
For more, here is a full article on Medium:
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd
The efficiency of a regex is superior to a DOM parser.
Look at this comparison:
http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team
You can find many more searching the web.

Categories

Resources