How does `findAll` work in BeautifulSoup? - python

Can someone please explain how findAll works in BeautifulSoup?
My doubt is this row: A = soup.findAll('strong',{'class':'name fn'}). it looks like find some characters matching certain criteria.
but the original codes of the webpage is like ......<STRONG class="name fn">iPod nano 16GB</STRONG>......
how does the ('strong',{'class':'name fn'}) pick it up? thanks.
original Python codes
from bs4 import BeautifulSoup
import urllib2
import re
url="http://m.harveynorman.com.au/ipods-audio-music/ipods/ipods"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
A = soup.findAll('strong',{'class':'name fn'})
for B in A:
print B.renderContents()

From the docs: Beautifulsoup Docs
Beautiful Soup provides many methods that traverse(goes through) the parse tree, gathering Tags and NavigableStrings that match criteria you specify.
From The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)
The findAll method traverses the tree, starting at the given point, and finds all the Tag and NavigableString objects that match the criteria you give. The signature for the findall method is this:
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
The name argument can be used to pass in a:
tag name (e.g. < B >)
a regular expression
a list or dictionary
the value True
a callable object
The keyword arguments impose restrictions on the attributes of a tag.
It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.
You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }),like the code you gave) but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary.
The doc has further examples to help you understand.

Related

Regular Expressions or BeautifulSoup - Varying Cases

I have 3 strings I am looking to retrieve that are characterized by the presence of two words: section and front. I'm terrible with regex.
contentFrame wsj-sectionfront economy_sf
contentFrame wsj-sectionfront business_sf
section-front markets
How can I match both of these words using one regular expression? This will be used to match the contents of a html page parsed by BeautifulSoup.
UPDATE:
I want to extract the main body of a webpage (https://www.wsj.com/news/business) that has the div tag: Main Content Housing. For some reason, BeautifulSoup isn't recognizing the highlighted class attribute using:
wsj_soup.find('div', attrs = {'class':'contentFrame wsj-sectionfront business_sf')
# Returns []
I'm trying to stay in BeautifulSoup as much as possible, but if regex is the way to go I will use that. From there I will more than likely search using the contents attribute to search for relevant keywords, but if anyone has a better idea of how to approach it please share.
One way to handle this would be to use two separate lookaheads which check for each of these words:
^(?=.*section)(?=.*front).*$
Demo

BeautifulSoup won't parse Article element

I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.

Python beautiful soup: is it soup.findAll, or soup.find_all? [duplicate]

I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup.
It is said that the function find_all is the same as findAll. I've tried both of them, but I believe they are different:
import urllib, urllib2, cookielib
from BeautifulSoup import *
site = "http://share.dmhy.org/topics/list?keyword=TARI+TARI+team_id%3A407"
rqstr = urllib2.Request(site)
rq = urllib2.urlopen(rqstr)
fchData = rq.read()
soup = BeautifulSoup(fchData)
t = soup.findAll('tr')
Can anyone tell me the difference?
In BeautifulSoup version 4, the methods are exactly the same; the mixed-case versions (findAll, findAllNext, nextSibling, etc.) have all been renamed to conform to the Python style guide, but the old names are still available to make porting easier. See Method Names for a full list.
In new code, you should use the lowercase versions, so find_all, etc.
In your example however, you are using BeautifulSoup version 3 (discontinued since March 2012, don't use it if you can help it), where only findAll() is available. Unknown attribute names (such as .find_all, which only is available in BeautifulSoup 4) are treated as if you are searching for a tag by that name. There is no <find_all> tag in your document, so None is returned for that.
from the source code of BeautifulSoup:
http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1260
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
# ...
# ...
findAll = find_all # BS3
findChildren = find_all # BS2

Does BeautifulSoup .select() method support use of regex?

Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.
The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.

Remove all inline styles using BeautifulSoup

I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:
[s.extract() for s in soup('script')]
But how to remove inline styles? For instance the following:
<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">
Should become:
<p>Text</p>
<img href="somewhere.com">
How to delete the inline class, id, name & style attributes of all elements?
Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
I wouldn't do this in BeautifulSoup - you'll spend a lot of time trying, testing, and working around edge cases.
Bleach does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup, I'd suggest you go with the "whitelist" approach, like Bleach does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
Here's my solution for Python3 and BeautifulSoup4:
def remove_attrs(soup, whitelist=tuple()):
for tag in soup.findAll(True):
for attr in [attr for attr in tag.attrs if attr not in whitelist]:
del tag[attr]
return soup
It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.
What about lxml's Cleaner?
from lxml.html.clean import Cleaner
content_without_styles = Cleaner(style=True).clean_html(content)
Based on jmk's function, i use this function to remove attributes base on a white list:
Work in python2, BeautifulSoup3
def clean(tag,whitelist=[]):
tag.attrs = None
for e in tag.findAll(True):
for attribute in e.attrs:
if attribute[0] not in whitelist:
del e[attribute[0]]
#e.attrs = None #delte all attributes
return tag
#example to keep only title and href
clean(soup,["title","href"])
Not perfect but short:
' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);
I achieved this using re and regex.
import re
def removeStyle(html):
style = re.compile(' style\=.*?\".*?\"')
html = re.sub(style, '', html)
return(html)
html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'
removeStyle(html)
Output: <p class="author" id="author_id" name="author_name">Text</p>
You can use this to strip any inline attribute by replacing "style" in the regex with the attribute's name.

Categories

Resources