BeautifulSoup cannot extract item using find_all() - python

I am try to get the location of text from HTML like below using BeautfulSoup,here are my html:
<p><em>code of Drink<br></em>
Budweiser: 4BDB1CD96<br>
price: 10$</p>
with codes:
soup = BeautifulSoup(html,'lxml')
result = re.escape('4BDB1CD96')
tag = soup.find(['li','div','p','em'],string=re.compile(result))
I cannot extract tag,but where I changed the find_all() into:
tag = soup.find(string=re.compile(result))
then I can get the result:
Budweiser: 4BDB1CD96
So I want to know why and how to get the result like in tag fromat

The problem here is that your tags have nested tags, and the text you are searching for is inside such a tag (p here).
So, the easiest approach is to use a lambda inside .find() to check tag names and if there .text property contains your pattern. Here, you do not even need a regex:
>>> tag = soup.find(lambda t: t.name in ['li','div','p','em'] and '4BDB1CD96' in t.text)
>>> tag
<p><em>code of Drink<br/></em>
Budweiser: 4BDB1CD96<br/>
price: 10$</p>
>>> tag.string
>>> tag.text
'code of Drink\nBudweiser: 4BDB1CD96\nprice: 10$'
Of course, you may use a regex for more complex searches:
r = re.compile('4BDB1CD96') # or whatever the pattern is
tag = soup.find(lambda t: t.name in ['li','div','p','em'] and r.search(t.text))

Related

Get real text with beautifulSoup after unwrap()

I need your help : I have <p> tag with many other tags in like in the example below :
<p>I <strong>AM</strong> a <i>text</i>.</p>
I would like to get only "I am a text" so I unwrap() the tags strong and i
with the code below :
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
Next, if i print the soup.p all is right, but if i don't know the name of the tag where my string is, problems start !
The code below should be more clear :
from bs4 import BeautifulSoup
html = '''
<html>
<header></header>
<body>
<p>I <strong>AM</strong> a <i>text</i>.</p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
print soup.p
# output :
# <p>I AM a text.</p>
for s in soup.stripped_strings:
print s
# output
'''
I
AM
a
text
.
'''
Why does BeautifulSoup separate all my strings while I concatenate it with my unwrap() before ?
If you .unwrap() the tag, you remove the tag, and put the content in the parent tag. But the text is not merged, as a result, you obtain a list of NavigableStrings (a subclass of str):
>>> [(c,type(c)) for c in soup.p.children]
[('I ', <class 'bs4.element.NavigableString'>), ('AM', <class 'bs4.element.NavigableString'>), (' a ', <class 'bs4.element.NavigableString'>), ('text', <class 'bs4.element.NavigableString'>), ('.', <class 'bs4.element.NavigableString'>)]
Each of these elements thus is a separated text element. So although you removed the tag itself, and injected the text, these strings are not concatenated. This seems logical, since the elements on the left and the right could still be tags: by unwrapping <strong> you have not unwrapped <i> at the same time.
You can however use .text, to obtain the full text:
>>> soup.p.get_text()
'I AM a text.'
Or you can decide to join the elements together:
>>> ''.join(soup.p.strings)
'I AM a text.'
I kinda hacky (yet easy) way I found to solve this is to convert the "clean" soup into a string and then parse it again.
So, in your code
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
string_soup = str(soup)
new_soup = BeautifulSoup(string_soup, 'lxml')
for s in new_soup .stripped_strings:
print s
should give the desirable output

How to get value between two different tags using beautiful soup?

I need to extract data present between a ending tag and a tag in below code snippet:
<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>
What I need is : W, 65, 3
But the problem is that these values can be empty too, like-
<td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>
I want to get these values if present else an empty string
I tried making use of nextSibling and find_next('br') but it returned
<br><b>Second Type :</b><br><b>Third Type :</b></br></br>
and
<br><b>Third Type :</b></br>
in case if values(W, 65, 3) are not present between the tags
</b> and <br>
All I need is that it should return a empty string if nothing is present between those tags.
I would use a <b> tag by </b> tag strategy, looking at what type of info their next_sibling contains.
I would just check whether their next_sibling.string is not None, and accordingly append the list :)
>>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> b = soup.find_all("b")
>>> data = []
>>> for tag in b:
if tag.next_sibling.string == None:
data.append(" ")
else:
data.append(tag.next_sibling.string)
>>> data
[' ', u'65', u'3'] # Having removed the first string
Hope this helps!
I would search for a td object then use a regex pattern to filter the data that you need, instead of using re.compile in the find_all method.
Like this:
import re
from bs4 import BeautifulSoup
example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third
Type :</b>3</td>
<td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>"""
soup = BeautifulSoup(example, "html.parser")
for o in soup.find_all('td'):
match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o))
print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))
This pattern finds all text between the </b> tag and <br> or </br> tags. The </br> tags are added when converting the soup object to string.
This example outputs:
W,65,3
,69,6
Just an example, you can alter to return an empty string if one of the regex matches is empty.
In [5]: [child for child in soup.td.children if isinstance(child, str)]
Out[5]: ['W', '65', '3']
Those text and tag are td's child, you can access them use contents(list) or children(generator)
In [4]: soup.td.contents
Out[4]:
[<b>First Type :</b>,
'W',
<br/>,
<b>Second Type :</b>,
'65',
<br/>,
<b>Third Type :</b>,
'3']
then you can get the text by test whether it's the instance of str
I think this works:
from bs4 import BeautifulSoup
html = '''<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>'''
soup = BeautifulSoup(html, 'lxml')
td = soup.find('td')
string = str(td)
list_tags = string.split('</b>')
list_needed = []
for i in range(1, len(list_tags)):
if list_tags[i][0] == '<':
list_needed.append('')
else:
list_needed.append(list_tags[i][0])
print(list_needed)
#['W', '65', '3']
Because the values you want are always after the end of tags it's easy to catch them this way, no need for re.

regex substitute html href and u tags in list (python)

In python I have a list with string items that look like that:
My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>
The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>
...
And what I want to do is to substitute (in every list item) the href syntax leaving only the link as text, so my list would look like:
My website is WEBSITE1
The link is LINK1
...
I was thinking about matching and replacing this regex:
<a href="(.*?)" target='_blank'><u>(.*?)</u></a>
with:
(.*?)
but it doesn't work. It seems to complex. Any easy way to have as output a list object with cleaned items?
You can also process the string with an HTML parser, e.g. BeautifulSoup and it's replace_with() - finding all a elements in a string and replacing them with the link texts:
>>> from bs4 import BeautifulSoup
>>> l = [
... """My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>""",
... """The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>"""
... ]
>>> for item in l:
... soup = BeautifulSoup(item, "html.parser")
... for a in soup("a"):
... a.replace_with(a.text)
... print(str(soup))
...
My website is WEBSITE1
The link is LINK1
Or, as pointed out by #user3100115 in comments, just getting the text of the "soup" object also works on your sample data:
>>> for item in l:
... print(BeautifulSoup(item, "html.parser").get_text())
...
My website is WEBSITE1
The link is LINK1
This regex seems to work
([^<]+)<a\s+href\s*=\s*"([^"]+).*
Regex Demo
Python Code
p = re.compile(r'<a\s+href\s*=\s*"([^"]+).*')
test_str = ["My website is <a href=\"WEBSITE1\" target='_blank'><u>WEBSITE1</u></a>", "The link is <a href=\"LINK1\" target='_blank'><u>LINK1</u></a>"]
for x in test_str:
print(re.sub(p, r"\1", x))
Ideone Demo
If I had to use a regex I would use something like
<a href.*?><u>(.*?)<\/u><\/a>
and then replace in a list comprehension
pattern = re.compile('<a href.*?><u>(.*?)<\/u><\/a>')
print [re.sub(pattern, r"\1", string) for string in my_list]
But consider using beautifulsoup or another html parser, as pointed out in other answers, which will provide you with a more generic solution
Regex Explanation
<a href.*?> match an a href tag, non greedy, up to the first closing bracket
<u> match the u tag
(.*?) match the string you want to keep
<\/u><\/a> match the closing tags
Retrieve the parenthesized capture group in your re.sub:
>>>s = """
My website is <a href="WEBSITE1" target='_blank'><u>WEBSITE1</u></a>
The link is <a href="LINK1" target='_blank'><u>LINK1</u></a>
"""
>>> re.sub("<a href=\"(.*?)\" target='_blank'><u>(.*?)</u></a>", r'\1', s)
'\nMy website is WEBSITE1 \nThe link is LINK1 \n'
Make sure the replacement string is a proper r escaped string, else it will simply replace with \1.
As your input is a list (let's assume its name is s):
>>> for i in range(0,len(s)):
... s[i] = re.sub("<a href=\"(.*?)\" target='_blank'><u>(.*?)</u></a>", r'\1', s[i])
>>> s
['My website is WEBSITE1', 'The link is LINK1']
If done regularly or on a large list, you can compile the regex before the loop.
Please clarify: your title says to remove html href tags, but in your example, you also remove the u tags.
Your answer can be simplified if we are guaranteed to have no other HTML tags than a and u (or if we want to remove all tags). In this case, we can search for anything between < and >, or for anything between <a or </a> and >. My answer assumes this, so it will be invalid if it is not.
import re
S = (
'My website is <u>WEBSITE1</u>',
'The link is <u>LINK1</u>',
)
RE1 = re.compile( r"<\/?[^>]*>")
RE2 = re.compile( r"<\/?[aA][^>]*>")
for s in S:
s1 = RE1.sub( "", s ) # remove all tags
s2 = RE2.sub( "", s ) # remove only <a> and </a> tags
print (s)
print (s1)
print (s2)
print ("")
When run (python2), it produces
My website is <u>WEBSITE1</u>
My website is WEBSITE1
My website is <u>WEBSITE1</u>
The link is <u>LINK1</u>
The link is LINK1
The link is <u>LINK1</u>
first line is original string, second is with all HTML tags removed, third is with only a tags removed.
I did not include a third choice: only remove the a href tags.

Matching a group with OR condition in pattern

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem
searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)
The output produced by the code is
data 1
None
If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to
None
data 2
What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.
Why don't use an HTML Parser instead - BeautifulSoup, for example:
>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']
Also note that str is a bad choice for a variable name - you are shadowing a built-in str.
You may reduce the regex like below with one capturing group.
re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Categories

Resources