Matching a group with OR condition in pattern - python

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.

Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.

You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

Related

BeautifulSoup cannot extract item using find_all()

I am try to get the location of text from HTML like below using BeautfulSoup,here are my html:
<p><em>code of Drink<br></em>
Budweiser: 4BDB1CD96<br>
price: 10$</p>
with codes:
soup = BeautifulSoup(html,'lxml')
result = re.escape('4BDB1CD96')
tag = soup.find(['li','div','p','em'],string=re.compile(result))
I cannot extract tag,but where I changed the find_all() into:
tag = soup.find(string=re.compile(result))
then I can get the result:
Budweiser: 4BDB1CD96
So I want to know why and how to get the result like in tag fromat
The problem here is that your tags have nested tags, and the text you are searching for is inside such a tag (p here).
So, the easiest approach is to use a lambda inside .find() to check tag names and if there .text property contains your pattern. Here, you do not even need a regex:
>>> tag = soup.find(lambda t: t.name in ['li','div','p','em'] and '4BDB1CD96' in t.text)
>>> tag
<p><em>code of Drink<br/></em>
Budweiser: 4BDB1CD96<br/>
price: 10$</p>
>>> tag.string
>>> tag.text
'code of Drink\nBudweiser: 4BDB1CD96\nprice: 10$'
Of course, you may use a regex for more complex searches:
r = re.compile('4BDB1CD96') # or whatever the pattern is
tag = soup.find(lambda t: t.name in ['li','div','p','em'] and r.search(t.text))

How to get value between two different tags using beautiful soup?

I need to extract data present between a ending tag and a tag in below code snippet:
<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>
What I need is : W, 65, 3
But the problem is that these values can be empty too, like-
<td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>
I want to get these values if present else an empty string
I tried making use of nextSibling and find_next('br') but it returned
<br><b>Second Type :</b><br><b>Third Type :</b></br></br>
and
<br><b>Third Type :</b></br>
in case if values(W, 65, 3) are not present between the tags
</b> and <br>
All I need is that it should return a empty string if nothing is present between those tags.
I would use a <b> tag by </b> tag strategy, looking at what type of info their next_sibling contains.
I would just check whether their next_sibling.string is not None, and accordingly append the list :)
>>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> b = soup.find_all("b")
>>> data = []
>>> for tag in b:
if tag.next_sibling.string == None:
data.append(" ")
else:
data.append(tag.next_sibling.string)
>>> data
[' ', u'65', u'3'] # Having removed the first string
Hope this helps!
I would search for a td object then use a regex pattern to filter the data that you need, instead of using re.compile in the find_all method.
Like this:
import re
from bs4 import BeautifulSoup
example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third
Type :</b>3</td>
<td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>"""
soup = BeautifulSoup(example, "html.parser")
for o in soup.find_all('td'):
match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o))
print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))
This pattern finds all text between the </b> tag and <br> or </br> tags. The </br> tags are added when converting the soup object to string.
This example outputs:
W,65,3
,69,6
Just an example, you can alter to return an empty string if one of the regex matches is empty.
In [5]: [child for child in soup.td.children if isinstance(child, str)]
Out[5]: ['W', '65', '3']
Those text and tag are td's child, you can access them use contents(list) or children(generator)
In [4]: soup.td.contents
Out[4]:
[<b>First Type :</b>,
'W',
<br/>,
<b>Second Type :</b>,
'65',
<br/>,
<b>Third Type :</b>,
'3']
then you can get the text by test whether it's the instance of str
I think this works:
from bs4 import BeautifulSoup
html = '''<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>'''
soup = BeautifulSoup(html, 'lxml')
td = soup.find('td')
string = str(td)
list_tags = string.split('</b>')
list_needed = []
for i in range(1, len(list_tags)):
if list_tags[i][0] == '<':
list_needed.append('')
else:
list_needed.append(list_tags[i][0])
print(list_needed)
#['W', '65', '3']
Because the values you want are always after the end of tags it's easy to catch them this way, no need for re.

Beautifulsoup how does findAll work

I've noticed some weird behavior of findAll's method:
>>> htmls="<html><body><p class=\"pagination-container\">slytherin</p><p class=\"pagination-container and something\">gryffindor</p></body></html>"
>>> soup=BeautifulSoup(htmls, "html.parser")
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
print(i.text)
slytherin
gryffindor
>>> for i in soup.findAll("p", {"class":"pag"}):
print(i.text)
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
print(i.text)
slytherin
gryffindor
>>> for i in soup.findAll("p",{"class":"pagination"}):
print(i.text)
>>> len(soup.findAll("p",{"class":"pagination-container"}))
2
>>> len(soup.findAll("p",{"class":"pagination-containe"}))
0
>>> len(soup.findAll("p",{"class":"pagination-contai"}))
0
>>> len(soup.findAll("p",{"class":"pagination-container and something"}))
1
>>> len(soup.findAll("p",{"class":"pagination-conta"}))
0
So, when we search for pagination-container it returns both the first and the second p tag. It made me think that it looks for a partial equality: something like if passed_string in class_attribute_value:. So I shortened the string in findAll method and it never managed to find anything!
How is that possible?
First of all, class is a special multi-valued space-delimited attribute and has a special handling.
When you write soup.findAll("p", {"class":"pag"}), BeautifulSoup would search for elements having class pag. It would split element class value by space and check if there is pag among the splitted items. If you had an element with class="test pag" or class="pag", it would be matched.
Note that in case of soup.findAll("p", {"class": "pagination-container and something"}), BeautifulSoup would match an element having the exact class attribute value. There is no splitting involved in this case - it just sees that there is an element where the complete class value equals the desired string.
To have a partial match on one of the classes, you can provide a regular expression or a function as a class filter value:
import re
soup.find_all("p", {"class": re.compile(r"pag")}) # contains pag
soup.find_all("p", {"class": re.compile(r"^pag")}) # starts with pag
soup.find_all("p", {"class": lambda class_: class_ and "pag" in class_}) # contains pag
soup.find_all("p", {"class": lambda class_: class_ and class_.startswith("pag")}) # starts with pag
There is much more to say, but you should also know that BeautifulSoup has CSS selector support (a limited one but covers most of the common use cases). You can write things like:
soup.select("p.pagination-container") # one of the classes is "pagination-container"
soup.select("p[class='pagination-container']") # match the COMPLETE class attribute value
soup.select("p[class^=pag]") # COMPLETE class attribute value starts with pag
Handling class attribute values in BeautifulSoup is a common source of confusion and questions, please see these related topics to gain more understanding:
BeautifulSoup returns empty list when searching by compound class names
Finding multiple attributes within the span tag in Python

Searching for text in Beautifulsoup

I have the below URL and would like to extract prices. For that I load the page into beautifulsoup:
soup = bs(content, 'lxml')
for e in soup.find_all(class_="totalPrice"):
Now I get a text that looks like this (this is one single Element of type bs4.element.Tag):
<td class="totalPrice" colspan="3">
<div data-component="track" data-hash="OLNYSRfCbdWGffSRe" data-stage="1" data-track="view"></div>
Total: £145
</td>
How can I create another find expression that will extract the 145? Is there a way to search as for "Total" and then get the text just next to it?
URL with original content that I extract
Use a regex!
>>> import re
>>> search_text = 'blah Total: result'
>>> result = re.findall(r'Total: (.*)', search_text)
>>> result
['result']
If you want to be more general and capture anything that looks like currency, try this:
>>> result = re.findall(r': (£\d*)', search_text)
This will get you the currency symbol £ + and of the following digits.
You can get text from tag
text = e.get_text()
and you have normal string Total: £145 so you can split it
text.split(' ') # [`Total:', '£145`]
slice it
text[8:] # 145
use regular expression, etc.

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Categories

Resources