Get real text with beautifulSoup after unwrap()

Get real text with beautifulSoup after unwrap() - python

I need your help : I have <p> tag with many other tags in like in the example below :
<p>I <strong>AM</strong> a <i>text</i>.</p>
I would like to get only "I am a text" so I unwrap() the tags strong and i
with the code below :
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
Next, if i print the soup.p all is right, but if i don't know the name of the tag where my string is, problems start !
The code below should be more clear :
from bs4 import BeautifulSoup
html = '''
<html>
<header></header>
<body>
<p>I <strong>AM</strong> a <i>text</i>.</p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
print soup.p
# output :
# <p>I AM a text.</p>
for s in soup.stripped_strings:
print s
# output
'''
I
AM
a
text
.
'''
Why does BeautifulSoup separate all my strings while I concatenate it with my unwrap() before ?

If you .unwrap() the tag, you remove the tag, and put the content in the parent tag. But the text is not merged, as a result, you obtain a list of NavigableStrings (a subclass of str):
>>> [(c,type(c)) for c in soup.p.children]
[('I ', <class 'bs4.element.NavigableString'>), ('AM', <class 'bs4.element.NavigableString'>), (' a ', <class 'bs4.element.NavigableString'>), ('text', <class 'bs4.element.NavigableString'>), ('.', <class 'bs4.element.NavigableString'>)]
Each of these elements thus is a separated text element. So although you removed the tag itself, and injected the text, these strings are not concatenated. This seems logical, since the elements on the left and the right could still be tags: by unwrapping <strong> you have not unwrapped <i> at the same time.
You can however use .text, to obtain the full text:
>>> soup.p.get_text()
'I AM a text.'
Or you can decide to join the elements together:
>>> ''.join(soup.p.strings)
'I AM a text.'

I kinda hacky (yet easy) way I found to solve this is to convert the "clean" soup into a string and then parse it again.
So, in your code
for elem in soup.find_all(['strong', 'i']):
elem.unwrap()
string_soup = str(soup)
new_soup = BeautifulSoup(string_soup, 'lxml')
for s in new_soup .stripped_strings:
print s
should give the desirable output

Related

How to get value between two different tags using beautiful soup?

I need to extract data present between a ending tag and a tag in below code snippet:
<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>
What I need is : W, 65, 3
But the problem is that these values can be empty too, like-
<td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>
I want to get these values if present else an empty string
I tried making use of nextSibling and find_next('br') but it returned
<br><b>Second Type :</b><br><b>Third Type :</b></br></br>
and
<br><b>Third Type :</b></br>
in case if values(W, 65, 3) are not present between the tags
</b> and <br>
All I need is that it should return a empty string if nothing is present between those tags.

I would use a <b> tag by </b> tag strategy, looking at what type of info their next_sibling contains.
I would just check whether their next_sibling.string is not None, and accordingly append the list :)
>>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> b = soup.find_all("b")
>>> data = []
>>> for tag in b:
if tag.next_sibling.string == None:
data.append(" ")
else:
data.append(tag.next_sibling.string)
>>> data
[' ', u'65', u'3'] # Having removed the first string
Hope this helps!

I would search for a td object then use a regex pattern to filter the data that you need, instead of using re.compile in the find_all method.
Like this:
import re
from bs4 import BeautifulSoup
example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third
Type :</b>3</td>
<td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>"""
soup = BeautifulSoup(example, "html.parser")
for o in soup.find_all('td'):
match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o))
print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))
This pattern finds all text between the </b> tag and <br> or </br> tags. The </br> tags are added when converting the soup object to string.
This example outputs:
W,65,3
,69,6
Just an example, you can alter to return an empty string if one of the regex matches is empty.

In [5]: [child for child in soup.td.children if isinstance(child, str)]
Out[5]: ['W', '65', '3']
Those text and tag are td's child, you can access them use contents(list) or children(generator)
In [4]: soup.td.contents
Out[4]:
[<b>First Type :</b>,
'W',
<br/>,
<b>Second Type :</b>,
'65',
<br/>,
<b>Third Type :</b>,
'3']
then you can get the text by test whether it's the instance of str

I think this works:
from bs4 import BeautifulSoup
html = '''<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>'''
soup = BeautifulSoup(html, 'lxml')
td = soup.find('td')
string = str(td)
list_tags = string.split('</b>')
list_needed = []
for i in range(1, len(list_tags)):
if list_tags[i][0] == '<':
list_needed.append('')
else:
list_needed.append(list_tags[i][0])
print(list_needed)
#['W', '65', '3']
Because the values you want are always after the end of tags it's easy to catch them this way, no need for re.

avoid outer element wrap in lxml

>>> from lxml import html
>>> html.tostring(html.fromstring('<div>1</div><div>2</div>'))
'<div><div>1</div><div>2</div></div>' # I dont want to outer <div>
>>> html.tostring(html.fromstring('I am pure text'))
'<p>I am pure text</p>' # I dont need the extra <p>
How to avoid the outer <div> and <p> in lxml?

By default, lxml will create a parent div when the string contains multiple elements.
You can work with individual fragments instead:
from lxml import html
test_cases = ['<div>1</div><div>2</div>', 'I am pure text']
for test_case in test_cases:
fragments = html.fragments_fromstring(test_case)
print(fragments)
output = ''
for fragment in fragments:
if isinstance(fragment, str):
output += fragment
else:
output += html.tostring(fragment).decode('UTF-8')
print(output)
output:
[<Element div at 0x3403ea8>, <Element div at 0x3489368>]
<div>1</div><div>2</div>
['I am pure text']
I am pure text

Python count number or letters on scraped page

I'm making requests in Python with requests.
I then use bs4 to select the wanted div. I now want to count the length of the text in that div, but the string I get out of it includes all the tags too, for example:
<div><a class="some_class">Text here!</a></div>
I want to only count the Text here!, without all the div and a tags.
Anyone have any idea how I could do that?

Do you mean:
tag.text
or
tag.string
tag means the tag that you found use soup.find(). Check the document for more details.
Here is a simple demo that helps you understand what I mean:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div><a class="some_class">Text here!</a></div></body></html>', "html.parser")
>>> tag = soup.find('div')
>>> tag
<div><a class="some_class">Text here!</a></div>
>>> tag.string
'Text here!'
>>> tag.text
'Text here!'
>>>
About count the length of the text, do you mean use len() here?
>>> tag.text
'Text here!'
>>> len(tag.text)
10

How to grab item outside of tag using python+beautifulsoup

Using python+beautifulsoup, let's say I have a <class 'bs4.element.Tag'> object, a:
<div class="class1"><em>text1</em> text2</div>
I can use the following command to extract text1 text2 and put it in b:
b = a.text
I can use the following command to extract text1 and put it in c:
c = a.findAll("em")[0].text
But how can I extract just text2?

I edited your HTML snippet slightly to have more than just one word in and outside the <em> tag so that getText() extracting all the text form your <div> container leads to the following output:
'text1 foo bar text2 foobar baz'
As you can see, this is just a string where the <em> tags have been removed. As far as I understood you want to kind of remove the contents of the <em> tag from the contents in your <div> container.
My solution is not very nice, but this can be done by using .replace() to replace the contents of the <em> tag with an empty string ''. Since this could lead to leading or trailing spaces you could call .lstrip() to get rid of those:
#!/usr/bin/env python3
# coding: utf-8
from bs4 import BeautifulSoup
html = '<div class="class1"><em>text1 foo bar</em> text2 foobar baz</div>'
soup = BeautifulSoup(html, 'html.parser')
result = soup.getText().replace(soup.em.getText(), '').lstrip()
print(result)
Output of print statement:
'text2 foobar baz'

You can remove all children of the div parent and then get the content of the parent like this:
>>> a = BeautifulSoup(out_div, 'html.parser')
>>> for child in a.div.findChildren():
... child.replace_with('')
...
<em>text1</em>
>>> a.get_text()
u' text2'

Using regex to extract all the html attrs

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.

Please don't use regex. Use BeautifulSoup:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}

Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get real text with beautifulSoup after unwrap() - python

Related

How to get value between two different tags using beautiful soup?

avoid outer element wrap in lxml

Python count number or letters on scraped page

How to grab item outside of tag using python+beautifulsoup

Using regex to extract all the html attrs

Categories

Resources