Only extracting text from this element, not its children - python

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text
The output to this is yesno. I want simply 'yes'.
What's the best way of achieving this?
Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.

what about .find(text=True)?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'
EDIT:
I think that I've understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'

You could use contents
>>> print soup.html.contents[0]
yes
or to get all the texts under html, use findAll(text=True, recursive=False)
>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False)
[u'x', u'yes']
above joined to form a single string
>>> ''.join(soup.html.findAll(text=True, recursive=False))
u'xyes'

This works for me in bs4:
import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])
output:
AC

You might want to look into lxml's soupparser module, which has support for XPath:
>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']

Related

How to parse data from nested HTML tags with beautifulsoup [duplicate]

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text
The output to this is yesno. I want simply 'yes'.
What's the best way of achieving this?
Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.
what about .find(text=True)?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'
EDIT:
I think that I've understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'
You could use contents
>>> print soup.html.contents[0]
yes
or to get all the texts under html, use findAll(text=True, recursive=False)
>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False)
[u'x', u'yes']
above joined to form a single string
>>> ''.join(soup.html.findAll(text=True, recursive=False))
u'xyes'
This works for me in bs4:
import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])
output:
AC
You might want to look into lxml's soupparser module, which has support for XPath:
>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']

Python count number or letters on scraped page

I'm making requests in Python with requests.
I then use bs4 to select the wanted div. I now want to count the length of the text in that div, but the string I get out of it includes all the tags too, for example:
<div><a class="some_class">Text here!</a></div>
I want to only count the Text here!, without all the div and a tags.
Anyone have any idea how I could do that?
Do you mean:
tag.text
or
tag.string
tag means the tag that you found use soup.find(). Check the document for more details.
Here is a simple demo that helps you understand what I mean:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div><a class="some_class">Text here!</a></div></body></html>', "html.parser")
>>> tag = soup.find('div')
>>> tag
<div><a class="some_class">Text here!</a></div>
>>> tag.string
'Text here!'
>>> tag.text
'Text here!'
>>>
About count the length of the text, do you mean use len() here?
>>> tag.text
'Text here!'
>>> len(tag.text)
10

get a string between a tag (TEST in <div><p>p1</p>TEST<p>p2</p></div>)

Code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div()
Result:
[<p>p1</p>, <p>p2</p>]
How come the string TEST isn't in the result set? How can I get it?
soup.div() is a shortcut for soup.div.find_all() which would find you all tags inside the div tag - as you can see, it does the job. TEST is a text between the p tags, or, in other words, the tail of the first p tag.
You can get the TEST string by getting the first p tag and using .next_sibling:
>>> soup.div.p.next_sibling
u'TEST'
Or, by getting the second element of the div's .contents:
>>> soup.div.contents[1]
u'TEST'
from bs4
import BeautifulSoup
soup = BeautifulSoup('<div><p>p1</p>TEST<p>p2</p></div>')
print soup.div.text
u'p1TESTp2'

Stripping space between html tags

I have a string that contains some html tags as follows:
"<p> This is a test </p>"
I want to strip all the extra spaces between the tags. I have tried the following:
In [1]: import re
In [2]: val = "<p> This is a test </p>"
In [3]: re.sub("\s{2,}", "", val)
Out[3]: '<p>This is atest</p>'
In [4]: re.sub("\s\s+", "", val)
Out[4]: '<p>This is atest</p>'
In [5]: re.sub("\s+", "", val)
Out[5]: '<p>Thisisatest</p>'
but am not able to get the desired result i.e. <p>This is a test</p>
How can I acheive this ?
Try using a HTML parser like BeautifulSoup:
from bs4 import BeautifulSoup as BS
s = "<p> This is a test </p>"
soup = BS(s)
soup.find('p').string = ' '.join(soup.find('p').text.split())
print soup
Returns:
<p>This is a test</p>
Try
re.sub(r'\s+<', '<', val)
re.sub(r'>\s+', '>', val)
However, this is too simplistic for general real-world use, where brokets are not necessarily always part if a tag. (Think <code> blocks, <script> blocks, etc.) You should be using a proper HTML parser for anything like that.
From the question, I see that you are using a very specific HTML string to parse. Although a regular expression is quick and dirty, its not recommend -- use a XML parser instead. Note: XML is stricter than HTML. So if you feel you might not have an XML, use BeautifulSoup as #Haidro suggests.
For your case, you'd do something like this:
>>> import xml.etree.ElementTree as ET
>>> p = ET.fromstring("<p> This is a test </p>")
>>> p.text.strip()
'This is a test'
>>> p.text = p.text.strip() # If you want to perform more operation on the string, do it here.
>>> ET.tostring(p)
'<p>This is a test</p>'
This may help:
import re
val = "<p> This is a test </p>"
re_strip_p = re.compile("<p>|</p>")
val = '<p>%s</p>' % re_strip_p.sub('', val).strip()
You can try this:
re.sub(r'\s+(</)|(<[^/][^>]*>)\s+', '$1$2', val);
s = '<p> This is a test </p>'
s = re.sub(r'(\s)(\s*)', '\g<1>', s)
>>> s
'<p> This is a test </p>'
s = re.sub(r'>\s*', '>', s)
s = re.sub(r'\s*<', '<', s)
>>> s
'<p>This is a test</p>'

how to eliminate an specific part of html file in python

I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:
text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> a=re.search ('(?<=<B>)Item 2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'
it deletes the text after the first item 2 not the second one.
I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.
BeautifulSoup is nice because it's pretty good at parsing malformed HTML.
For your specific example:
>>> import bs4
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body>ItemĀ 2. <b>ItemĀ 2. Properties</b></body></html>
If you really wanna use regular expressions, a non-greedy regex would work for your example:
>>> import re
>>> text = """<A href="#106">Item 2. <B>Item 2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item 2\.", text)
>>> m.group(0)
'<A href="#106">Item 2.'

Categories

Resources