I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.
This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.
I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.
Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:
<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>
Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:
from bs4 import BeautifulSoup
import re
class Styler(object):
img_attributes = {'float' : 'align'}
def __init__(self, soup):
self.soup = soup
def format_factory(self):
self.handle_image()
def handle_image(self):
tag = self.soup.find_all("img", style = re.compile('.'))
print tag
for i in xrange(len(tag)):
old_attributes = tag[i]['style']
tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
del tag[i]['style']
print tokens
for j in xrange(0, len(tokens), 2):
if tokens[j] in Styler.img_attributes:
tokens[j] = Styler.img_attributes[tokens[j]]
tag[i][tokens[j]] = tokens[j+1]
if __name__ == '__main__':
html = """
<body>hello</body>
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
<blockquote>my blockquote text</blockquote>
<div style="padding-left:25px; padding-right:25px;">text here</div>
<body>goodbye</body>
"""
soup = BeautifulSoup(html)
s = Styler(soup)
s.format_factory()
Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.
For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.
For example:
>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'
So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.
Instead of reinvent the wheel use the stoneage package http://pypi.python.org/pypi/StoneageHTML
Related
I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text
I have something like this:
<img style="background:url(/theRealImage.jpg) no-repate 0 0; height:90px; width:92px;") src="notTheRealImage.jpg"/>
I am using beautifulsoup to parse the html. Is there away to pull out the "url" in the "background" css attribute?
You've got a couple options- quick and dirty or the Right Way. The quick and dirty way (which will break easily if the markup is changed) looks like
>>> from BeautifulSoup import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('<html><body><img style="background:url(/theRealImage.jpg) no-repate 0 0; height:90px; width:92px;") src="notTheRealImage.jpg"/></body></html>')
>>> style = soup.find('img')['style']
>>> urls = re.findall('url\((.*?)\)', style)
>>> urls
[u'/theRealImage.jpg']
Obviously, you'll have to play with that to get it to work with multiple img tags.
The Right Way, since I'd feel awful suggesting someone use regex on a CSS string :), uses a CSS parser. cssutils, a library I just found on Google and available on PyPi, looks like it might do the job.
I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.
This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.
BeautifulSoup is the first thing that comes to mind.
Another possibility is copying/pasting it in TextMate and then running regular expressions.
What do you suggest?
This is the script that I ended up writing, but as I said, I'm looking for a more general solution.
from BeautifulSoup import BeautifulSoup
import urllib2
url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
tds = row.findAll('td')
if(len(tds)==4):
countrycode = tds[1].string
timezone = tds[2].string
if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())
Comments and suggestions for improvement to my python code welcome, too ;)
For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.
Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
SimpleHTMLDom PHP
Hpricot & Nokogiri Ruby
Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.
I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.
This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).
EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.
It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.
While we were building SerpAPI we tested many platform/parser.
Here is the benchmark result for Python.
For more, here is a full article on Medium:
https://medium.com/#vikoky/fastest-html-parser-available-now-f677a68b81dd
The efficiency of a regex is superior to a DOM parser.
Look at this comparison:
http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team
You can find many more searching the web.
str1 = abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>
We need the contents inside the h1 tag and h2 tag.
What is the best way to do that? Thanks
Thanks for the help!
The best way if it needs to scale at all would be with something like BeautifulSoup.
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
>>> soup.h1
<h1>The content we need</h1>
>>> soup.h1.text
u'The content we need'
>>> soup.h2
<h2>The content we need2</h2>
>>> soup.h2.text
u'The content we need2'
It could be done with a regular expression too but this is probably more what you want. A larger example of what you are wanting could be good. Without knowing quite what you're wanting to parse it's hard to help properly.
First bit of advice: DON'T USE REGULAR EXPRESSIONS FOR HTML/XML PARSING!
Now that we've cleared that up, I'd suggest you look at Beautiful Soup. There are other SGML/XML/HTML parsers available for Python. However this one is the favorite for dealing with the sloppy "tag soup" that most of us find out in the real world. It doesn't require that the inputs be standards conformant nor well-formed. If your browser can manage to render it than Beautiful Soup can probably manage to parse it for you.
(Still tempted to use regular expressions for this task? Thinking "it can't be that bad, I just want to extract just what's in the <h1>...</h1> and <h2>...</h2> containers." and ... "I'll never need to handle any other corner cases" That way lies madness. The code you write based on that line of reasoning will be fragile. It'll work just well enough to pass your tests and then it will get worse and worse every time you need to fix "just one more thing." Seriously, import a real parser and use it).
I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.
Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!
Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.
BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.