Beautifulsoup how does findAll work - python

I've noticed some weird behavior of findAll's method:
>>> htmls="<html><body><p class=\"pagination-container\">slytherin</p><p class=\"pagination-container and something\">gryffindor</p></body></html>"
>>> soup=BeautifulSoup(htmls, "html.parser")
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
print(i.text)
slytherin
gryffindor
>>> for i in soup.findAll("p", {"class":"pag"}):
print(i.text)
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
print(i.text)
slytherin
gryffindor
>>> for i in soup.findAll("p",{"class":"pagination"}):
print(i.text)
>>> len(soup.findAll("p",{"class":"pagination-container"}))
2
>>> len(soup.findAll("p",{"class":"pagination-containe"}))
0
>>> len(soup.findAll("p",{"class":"pagination-contai"}))
0
>>> len(soup.findAll("p",{"class":"pagination-container and something"}))
1
>>> len(soup.findAll("p",{"class":"pagination-conta"}))
0
So, when we search for pagination-container it returns both the first and the second p tag. It made me think that it looks for a partial equality: something like if passed_string in class_attribute_value:. So I shortened the string in findAll method and it never managed to find anything!
How is that possible?

First of all, class is a special multi-valued space-delimited attribute and has a special handling.
When you write soup.findAll("p", {"class":"pag"}), BeautifulSoup would search for elements having class pag. It would split element class value by space and check if there is pag among the splitted items. If you had an element with class="test pag" or class="pag", it would be matched.
Note that in case of soup.findAll("p", {"class": "pagination-container and something"}), BeautifulSoup would match an element having the exact class attribute value. There is no splitting involved in this case - it just sees that there is an element where the complete class value equals the desired string.
To have a partial match on one of the classes, you can provide a regular expression or a function as a class filter value:
import re
soup.find_all("p", {"class": re.compile(r"pag")}) # contains pag
soup.find_all("p", {"class": re.compile(r"^pag")}) # starts with pag
soup.find_all("p", {"class": lambda class_: class_ and "pag" in class_}) # contains pag
soup.find_all("p", {"class": lambda class_: class_ and class_.startswith("pag")}) # starts with pag
There is much more to say, but you should also know that BeautifulSoup has CSS selector support (a limited one but covers most of the common use cases). You can write things like:
soup.select("p.pagination-container") # one of the classes is "pagination-container"
soup.select("p[class='pagination-container']") # match the COMPLETE class attribute value
soup.select("p[class^=pag]") # COMPLETE class attribute value starts with pag
Handling class attribute values in BeautifulSoup is a common source of confusion and questions, please see these related topics to gain more understanding:
BeautifulSoup returns empty list when searching by compound class names
Finding multiple attributes within the span tag in Python

Related

bs4 findAll not finding class tags

I'm trying to parse through a table, and I am using bs4. When I use the find_all with a specific class tag, nothing is returned. However, when I do not specify the class, it returns something. i.e, this returns the table and all of the td elements
from bs4 import BeautifulSoup as soup
page_soup = soup(html, 'html.parser')
stat_table = page_soup.find_all('table')
stat_table = stat_table[0]
with open ('stats.txt','w', encoding = 'utf-8') as q:
for row in stat_table.find_all('tr'):
for cell in row.find_all('td'):
q.write(cell.text.strip().ljust(18))
If I try to use this:
page_soup = soup(html, 'html.parser')
stat_table = page_soup.find_all('table')
stat_table = stat_table[0]
with open ('stats.txt','w', encoding = 'utf-8') as q:
for row in stat_table.find_all('tr'):
for cell in row.find_all('td',{'class':'blah'}):
q.write(cell.text.strip().ljust(18))
this code should return a specific td element with the specified class, but nothing is returned. Any help would be greatly appreciated.
The class attribute isn't a normal string, but a multi-valued attribute.1
For example:
>>> text = "<div><span class='a b c '>spam</span></div>"
>>> soup = BeautifulSoup(text, 'html.parser')
>>> soup.span['class']
['a', 'b', 'c']
To search for a multi-valued attribute, you should pass multiple values:
>>> soup.find('span', class_=('a', 'b', 'c'))
<span class="a b c">spam</span>
Notice that, even though BeautifulSoup is presenting the values as a list, they actually act more like a set—you can pass the same values in any order, and duplicates are ignored:
>>> soup.find('span', class_={'a', 'b', 'c'})
<span class="a b c">spam</span>
>>> soup.find('span', class_=('c', 'b', 'a', 'a'))
<span class="a b c">spam</span>
You can also search on a multi-valued attribute with a string, which will find any elements whose attribute includes that string as one of its values:
>>> soup.find('span', class_='c')
<span class="a b c">spam</span>
But if you pass a string with whitespace… as far as I can tell, what it does isn't actually documented, but what happens in practice is that it will match exactly the (arbitrary) way the string is handed to BeautifulSoup by the parser.
As you can see above, even though the HTML had 'a b c ' in it, BeautifulSoup has turned it into 'a b c'—stripping whitespace off the ends, and turning any internal runs of whitespace into single spaces. So, that's what you have to search for:
>>> soup.find('span', class_='a b c ')
>>> soup.find('span', class_='a b c')
<span class="a b c">spam</span>
But, again, you're better off searching with a sequence or set of separate values than trying to guess how to put them together into a string that happens to work.
So, what you want to do is probably:
for cell in row.find_all('td', {'class': ('column-even', 'table-column-even', 'ft_enrolled')}):
Or, maybe you don't want to think in DOM terms but in CSS-selector terms:
>>> soup.select('span.a.b.c')
[<span class="a b c">spam</span>]
Notice that CSS also doesn't care about the order of the classes, or about duplicates:
>>> soup.select('span.c.a.b.c')
[<span class="a b c">spam</span>]
Also, this allows you to search for a subset of the classes, rather than just one or all of them:
>>> soup.select('span.c.b')
[<span class="a b c">spam</span>]
1. This is a change from Beautiful Soup 3. Which shouldn't even need to be mentioned, as BS3 has been dead for nearly a decade, and doesn't run on Python 3.x or, in some cases, even 2.7. But people keep copying and pasting old BS3 code into blog posts and Stack Overflow answers, so other people keep getting surprised that the code they found online doesn't actually work. If that's what happened here, you need to learn to spot BS3 code so you can ignore it and look elsewhere.

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.
Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.
You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div()
Result:
[<p>p1</p>, <p>p2</p>]
How come the string TEST isn't in the result set? How can I get it?
soup.div() is a shortcut for soup.div.find_all() which would find you all tags inside the div tag - as you can see, it does the job. TEST is a text between the p tags, or, in other words, the tail of the first p tag.
You can get the TEST string by getting the first p tag and using .next_sibling:
>>> soup.div.p.next_sibling
u'TEST'
Or, by getting the second element of the div's .contents:
>>> soup.div.contents[1]
u'TEST'
from bs4
import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div.text
u'p1TESTp2'

What do group() and contents[] mean?

I'am learning about the modules of re and BeautifulSoup. I have a doubt in few lines of the next code. I don't know the use of group() and what's inside of brackets in contents[]
from bs4 import BeautifulSoup
import urllib2
import re
url = 'http://www.ebay.es/itm/LOTE-5-BOTES-CERVEZAARGUS-SET-5-BEER-CANSLOT-5-CANETTES-BIRES-LATTINE-BIRRA-/321162173293' #raw_input('URL: ')
code = urllib2.urlopen(url).read();
soup = BeautifulSoup(code)
tag = soup.find('span', id='v4-27').contents[0]
price_string = re.search('(\d+,\d+)', tag).group(1)
precio_final = float(price_string.replace(',' , '.'))
print precio_final
.contents returns a list of items from a tag. For example:
>>> from bs4 import BeautifulSoup as BS
>>> soup = BS('<span class="foo"> bar baz link</span>')
>>> print soup.find('span').contents
[u' bar baz ', link]
[0] is used to access the first element of the list .contents returns. In the example above, it will return bar baz
.group(1) returns the second (indexing starts at 0, remember) matched value from a regular expression. Looking at your regular expression, it returns the second digit from something that looks like n1,n2.

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Categories

Resources