I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:
[s.extract() for s in soup('script')]
But how to remove inline styles? For instance the following:
<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">
Should become:
<p>Text</p>
<img href="somewhere.com">
How to delete the inline class, id, name & style attributes of all elements?
Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
I wouldn't do this in BeautifulSoup - you'll spend a lot of time trying, testing, and working around edge cases.
Bleach does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup, I'd suggest you go with the "whitelist" approach, like Bleach does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
Here's my solution for Python3 and BeautifulSoup4:
def remove_attrs(soup, whitelist=tuple()):
for tag in soup.findAll(True):
for attr in [attr for attr in tag.attrs if attr not in whitelist]:
del tag[attr]
return soup
It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.
What about lxml's Cleaner?
from lxml.html.clean import Cleaner
content_without_styles = Cleaner(style=True).clean_html(content)
Based on jmk's function, i use this function to remove attributes base on a white list:
Work in python2, BeautifulSoup3
def clean(tag,whitelist=[]):
tag.attrs = None
for e in tag.findAll(True):
for attribute in e.attrs:
if attribute[0] not in whitelist:
del e[attribute[0]]
#e.attrs = None #delte all attributes
return tag
#example to keep only title and href
clean(soup,["title","href"])
Not perfect but short:
' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);
I achieved this using re and regex.
import re
def removeStyle(html):
style = re.compile(' style\=.*?\".*?\"')
html = re.sub(style, '', html)
return(html)
html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'
removeStyle(html)
Output: <p class="author" id="author_id" name="author_name">Text</p>
You can use this to strip any inline attribute by replacing "style" in the regex with the attribute's name.
Related
Newbie with Beautiful Soup would appreciate any pointers.
I'm working with a page which has a lot of:
<p data-v-04dd08f2> .. </p>
elements. Inside the p is a string value, which I need and an embedded span.
Question might be very simple... I am trying to use find_all to 'get' a list of all those elements which I would subsequently parse out to get the tokens I need from inside.
Can someone put me out my misery and tell me how the find_all should be structured to get these?
I've tried:
find_all('p',{'data':'v-04dd08f2'} } # nope
find_all('p', {"attributes': 'v-04dd08f2'} ) # nope
and lots of other combinations all to no avail.
Thanks!
If you are willing to use CSS selectors instead, which I personally prefer to BeautifulSoup's find_* methods and the paragraph tags are in fact exactly what you indicated, that "data-v-04dd08f2" is an attribute of the tag, then the following should do the trick
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p data-v-04dd08f2> .. </p>')
p_tags = soup.select('p[data-v-04dd08f2]')
print(p_tags)
#[<p data-v-04dd08f2=""> .. </p>]
bs4 uses SoupSieve to implement CSS selectors. The SoupSieve docs for selecting based on attribute are here. Note that based on your attempts I suspect you might actually be looking for p tags who have a data attribute = 'v-04dd08f2'. If that's the case the soup.select string should be soup.select('p[data=v-04dd08f2]')
This will return all elements having attribute name starting with "data-v-"
match_pattern = 'data-v-'
m = soup.findAll(lambda tag: any(attr.startswith(match_pattern) for attr in tag.attrs.keys()))
element.attrs is a key-value structure, {attribute_name: attribute_value}
Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.
The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.
Can someone please explain how findAll works in BeautifulSoup?
My doubt is this row: A = soup.findAll('strong',{'class':'name fn'}). it looks like find some characters matching certain criteria.
but the original codes of the webpage is like ......<STRONG class="name fn">iPod nano 16GB</STRONG>......
how does the ('strong',{'class':'name fn'}) pick it up? thanks.
original Python codes
from bs4 import BeautifulSoup
import urllib2
import re
url="http://m.harveynorman.com.au/ipods-audio-music/ipods/ipods"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
A = soup.findAll('strong',{'class':'name fn'})
for B in A:
print B.renderContents()
From the docs: Beautifulsoup Docs
Beautiful Soup provides many methods that traverse(goes through) the parse tree, gathering Tags and NavigableStrings that match criteria you specify.
From The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)
The findAll method traverses the tree, starting at the given point, and finds all the Tag and NavigableString objects that match the criteria you give. The signature for the findall method is this:
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
The name argument can be used to pass in a:
tag name (e.g. < B >)
a regular expression
a list or dictionary
the value True
a callable object
The keyword arguments impose restrictions on the attributes of a tag.
It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.
You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }),like the code you gave) but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary.
The doc has further examples to help you understand.
I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.
This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.
I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.
Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:
<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>
Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:
from bs4 import BeautifulSoup
import re
class Styler(object):
img_attributes = {'float' : 'align'}
def __init__(self, soup):
self.soup = soup
def format_factory(self):
self.handle_image()
def handle_image(self):
tag = self.soup.find_all("img", style = re.compile('.'))
print tag
for i in xrange(len(tag)):
old_attributes = tag[i]['style']
tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
del tag[i]['style']
print tokens
for j in xrange(0, len(tokens), 2):
if tokens[j] in Styler.img_attributes:
tokens[j] = Styler.img_attributes[tokens[j]]
tag[i][tokens[j]] = tokens[j+1]
if __name__ == '__main__':
html = """
<body>hello</body>
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
<blockquote>my blockquote text</blockquote>
<div style="padding-left:25px; padding-right:25px;">text here</div>
<body>goodbye</body>
"""
soup = BeautifulSoup(html)
s = Styler(soup)
s.format_factory()
Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.
For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.
For example:
>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'
So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.
Instead of reinvent the wheel use the stoneage package http://pypi.python.org/pypi/StoneageHTML
I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:
<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?
At the moment, my code looks like what is below:
import re
import urllib2,sys
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from BeautifulSoup import BeautifulSoup, NavigableString
address='http://www.example.com/'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
html=soup.prettify()
html=html.replace(' ', ' ')
html=html.replace('í','í')
root=fromstring(html)
I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.
EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.
It's not clear to me from your question why you need to worry about the div tags -- what about doing just:
soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string
On the HTML you give, running this emits exactly:
####I want whatever is located here ###
which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).
Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:
thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
print thetd.string
...i.e., not much harder at all!-)
or you could be using pyquery, since BeautifulSoup is not actively maintained anymore, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
first, install pyquery with
easy_install pyquery
then your script could be as simple as
from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]
pyquery uses the css selector syntax familiar from jQuery which I find more intuitive than BeautifulSoup's. It uses lxml underneath, and is much faster than BeautifulSoup. But BeautifulSoup is pure python, and thus works on Google's app engine as well
The lxml library is now the standard for parsing html in python. The interface can seem awkward at first, but it is very serviceable for what it does.
You should let the libary handle the xml specialism, such as those escaped &entities;
import lxml.html
html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
<td class="author">####I want whatever is located here, eh? í ###</td>
</tr></tbody></table></div></div></body></html>"""
root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")
print tds # gives [<Element td at 84ee2cc>]
print tds[0].text # what you want, including the 'í'
BeautifulSoup is certainly the canonical HTML parser/processor. But if you have just this kind of snippet you need to match, instead of building a whole hierarchical object representing the HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
author_td, end_td = makeHTMLTags("td")
# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))
search = author_td + SkipTo(end_td)("body") + end_td
for match in search.searchString(html):
print match.body
Pyparsing's makeHTMLTags function does a lot more than just emit "<tag>" and "</tag>" expressions. It also handles:
caseless matching of tags
"<tag/>" syntax
zero or more attribute in the opening tag
attributes defined in arbitrary order
attribute names with namespaces
attribute values in single, double, or no quotes
intervening whitespace between tag and symbols, or attribute name, '=', and value
attributes are accessible after parsing as named results
These are the common pitfalls when considering using a regex for HTML scraping.