str1 = abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>
We need the contents inside the h1 tag and h2 tag.
What is the best way to do that? Thanks
Thanks for the help!
The best way if it needs to scale at all would be with something like BeautifulSoup.
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
>>> soup.h1
<h1>The content we need</h1>
>>> soup.h1.text
u'The content we need'
>>> soup.h2
<h2>The content we need2</h2>
>>> soup.h2.text
u'The content we need2'
It could be done with a regular expression too but this is probably more what you want. A larger example of what you are wanting could be good. Without knowing quite what you're wanting to parse it's hard to help properly.
First bit of advice: DON'T USE REGULAR EXPRESSIONS FOR HTML/XML PARSING!
Now that we've cleared that up, I'd suggest you look at Beautiful Soup. There are other SGML/XML/HTML parsers available for Python. However this one is the favorite for dealing with the sloppy "tag soup" that most of us find out in the real world. It doesn't require that the inputs be standards conformant nor well-formed. If your browser can manage to render it than Beautiful Soup can probably manage to parse it for you.
(Still tempted to use regular expressions for this task? Thinking "it can't be that bad, I just want to extract just what's in the <h1>...</h1> and <h2>...</h2> containers." and ... "I'll never need to handle any other corner cases" That way lies madness. The code you write based on that line of reasoning will be fragile. It'll work just well enough to pass your tests and then it will get worse and worse every time you need to fix "just one more thing." Seriously, import a real parser and use it).
Related
I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port.
There are a number of those.
early version by gfxmonk, based on BeautifulSoup
version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.
I use Yuri's version, as it is most recent, and seems to be in active development.
I managed to make it run on Google App Engine using Python 2.7.
Now the "problem" is that it returns HTML, whereas I need pure text.
The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.
My questions:
Is there a way to get pure text from Python Readability version that I use without forking the code?
Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?
You can use html2text. It is a nifty tool.
Here is a link on how to use it with python readability tool - together they are called read2text.
http://brettterpstra.com/scripting-readability-markdownify-for-clipping-web-pages/
Hope this helps :)
Not to let it linger, my current solution
I did not find the way to use Readability ports.
I decided to use Beautiful Soup, version 4
BS has one simple function to extract text
code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
First, you extract the HTML contents with readability,
html_snippet = Document(html).summary()
Then, use a library to remove HTML tags. There are caveats:
1) you probably need spaces, "<p>some text<br>other text" shouldn't be "some textother text", and you might need the lists converted into " - ".
2) "#&39;" should be displayed as "'", and ">" should be displayed as ">" -- this is called HTML entities replacement (see below).
I usually use a library called bleach to clean out unnecessary tags and attributes:
cleaned_text = bleach.clean(html_snippet, tags=[])
or
cleaned_text = bleach.clean(html_snippet, tags=['i', 'b'])
You need to use any kind of html2text library if you want to remove all tags and get a better text formatting, or you can implement custom formatting procedure yourself.
But I think you now got the raw idea.
For a simple text formatting with bleach:
For example, if you want paragraphs as "\n", and list items as "\n - ", then:
norm_html = bleach.clean(html_snippet, tags=['p', 'br', 'li'])
replaced_html = norm_html.replace('<p>', '\n').replace('</p>', '\n')
replaced_html = replaced_html.replace('<br>', '\n').replace('<li>', '\n - ')
cleaned_text = bleach.clean(replaced_html, tags=[])
For a regexp that only strips HTML tags and does entities replacement (">" should be ">" and so on), you can take a look at https://stackoverflow.com/a/7778368/217895
I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.
This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.
I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.
Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:
<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>
Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:
from bs4 import BeautifulSoup
import re
class Styler(object):
img_attributes = {'float' : 'align'}
def __init__(self, soup):
self.soup = soup
def format_factory(self):
self.handle_image()
def handle_image(self):
tag = self.soup.find_all("img", style = re.compile('.'))
print tag
for i in xrange(len(tag)):
old_attributes = tag[i]['style']
tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
del tag[i]['style']
print tokens
for j in xrange(0, len(tokens), 2):
if tokens[j] in Styler.img_attributes:
tokens[j] = Styler.img_attributes[tokens[j]]
tag[i][tokens[j]] = tokens[j+1]
if __name__ == '__main__':
html = """
<body>hello</body>
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
<blockquote>my blockquote text</blockquote>
<div style="padding-left:25px; padding-right:25px;">text here</div>
<body>goodbye</body>
"""
soup = BeautifulSoup(html)
s = Styler(soup)
s.format_factory()
Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.
For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.
For example:
>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'
So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.
Instead of reinvent the wheel use the stoneage package http://pypi.python.org/pypi/StoneageHTML
We are converting DOCX to HTML through some external converter tool.
The generated HTML for tables contains something like this:
<td><div><span><b>Patienten</b></span></div></td>
The <div> and <span> tags inside TD are completely superfluous here.
The expected result is
<td><b>Patienten</b></td>
Is there some chance to remove them in a sane way using BeautifulSoup?
Well, the <div> and <span> tags have a structural meaning, that cannot be automatically guessed as "superfluous".
Your problem looks very similar to AST (Abstract Syntax Tree) optimization done in compilers. You could try to define some rules and build a SoupOptimizer to take a tree (your document) and produce an optimized output tree. Rules could be:
span(content) -> content, if span.attributes is empty
div(content) -> content, if div.attributes is empty
Note, that tree transformations on XML dialects can be done with XSLT. Just be ready to have your brain turned inside out before you see the light!
The way we do it is to use lxml and determine the parents and children of every element. If there is no text content difference in the parents and children then we have a set of rules that we follow to retain certain children while tossing the parents. And then forcing the appropriate block elements In your case b is a child of span, div and td we know that the td tag is the structuring element that is relevant so we get rid of the others. Again this requires testing the text content of each of the nested elements.
You could use the strip_tags function of Jesse Dhillon's answer of this question
You could rearrange the parse tree like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<td><div><span><b>Patienten</b></span></div></td>")
td = soup.td
b = soup.td.div.span.b
td.insert(0,b)
td.div.extract()
print soup
I like the approach suggested by #Daren Thomas, but be aware that removing those "useless" tags could drastically affect the rendered appearance of the document thanks to JavaScript (less likely) or CSS (much more likely, possibly even probable) that relies on the resulting HTML to follow certain structural patterns, even if they are wasteful.
This makes the life of the tool writer much easier. Assume that some given construct in the DOCX has two possible variations. One of these requires a lot of boilerplate so you can attach a few special attributes (say a text-align or some such). The other doesn't. It's way easier to just always generate the boilerplate and write your CSS or what-have-you with that fact in mind.
If Beautiful Soup alone isn't sufficient, you can resort to regular expression.
import re
ch = 'sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week'
# <td><b>Patienten</b></td>
RE = '(<td>)<div><span>(<b>.*?</b>)</span></div>(</td>)'
pat = re.compile(RE)
print ch
print pat.sub('\\1\\2\\3',ch)
result
sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week
sunny day<td><b>Patienten</b></td>rainy week
Easy, easyn't it ?
A preliminary inspection can be done to determine if the replacement must really be done or not.
I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.
This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.
BeautifulSoup is the first thing that comes to mind.
Another possibility is copying/pasting it in TextMate and then running regular expressions.
What do you suggest?
This is the script that I ended up writing, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
tds = row.findAll('td')
if(len(tds)==4):
countrycode = tds[1].string
timezone = tds[2].string
if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improvement to my python code welcome, too ;)
For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.
Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
SimpleHTMLDom PHP
Hpricot & Nokogiri Ruby
Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.
I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.
This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).
EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.
It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.
While we were building SerpAPI we tested many platform/parser.
Here is the benchmark result for Python.
For more, here is a full article on Medium:
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd
The efficiency of a regex is superior to a DOM parser.
Look at this comparison:
http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team
You can find many more searching the web.
<div>random contents without < or > , but has ( ) <div>
Just need to fix the closing div tag
so it looks like <div>random contents</div>
I need to do it in Python by regex.
The input is exact like the first line, there will no any < or > in random contents
replace
(<div>[^<]*<)(div>)
with
$1/$2
Note: This is bad practice, don't do it unless it's absolutely necessary!
I wouldn't recommend a regex - use something like tidy (which is a Python wrapper around HTML Tidy).
Avoid using regular expressions for dealing with HTML.
This is how it would be parsed in a DOM tree as it currently is:
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup('<div>random contents<div>')
<div>random contents<div></div></div>
Or are you wanting to turn the second <div> into </div> (which a browser certainly would not do)?