I have some html code like this
<p><span class="map-sub-title">abc</span>123</p>
I used Beautifulsoup,and here's my code :
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text
I get the result 'abc123'
But I want to get the result '123' not 'abc123'
You can use the function decompose() to remove the span tag and then get the text you want.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")
for span in soup.find_all("span", {'class':'map-sub-title'}):
span.decompose()
print(soup.text)
You can also use extract() to remove unwanted tag before you get the text from tag like below.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()
print(soup1.text)
Although every response on this thread seems acceptable I shall point out another method for this case:
soup.find("span", {'class':'map-sub-title'}).next_sibling
You can use next_sibling to navigate between elements that are on the same parent, in this case the p tag.
One of the many ways, is to use contents over the parent tag (in this case it's <p>).
If you know the position of the string, you can directly use this:
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'
If, you want a generalized solution, where you don't know the position, you can check if the type of content is NavigableString like this:
>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']
With the second method, you'll be able to get all the text that is directly a child of the <p> tag. For completeness's sake, here's one more example:
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:
>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'
Related
I'm trying to get text (example text) inside tags using beautiful soup
The html structure looks like this:
...
<div>
<div>Description</div>
<span>
<div><span>example text</span></div>
</span>
</div>
...
What i tried:
r = requests.get(url)
soup = bs(r.content, 'html.parser')
desc = soup.find('div.div.span.div.span')
print(str(desc))
You cannot use .find() with multiple tag names in it stacked like this. You need to repeatedly call .find() to get desired result. Check out docs for more information. Below code will give you desired output:
soup.find('div').find('span').get_text()
Your selector is wrong.
>>> from bs4 import BeautifulSoup
>>> data = '''\
... <div>
... <div>Description</div>
... <span>
... <div><span>example text</span></div>
... </span>
... </div>'''
>>> soup = BeautifulSoup(data, 'html.parser')
>>> desc = soup.select_one('div span div span')
>>> desc.text
'example text'
>>>
r = requests.get(url)
soup = bs(r.content, 'html.parser')
desc = soup.find('div').find('span')
print(desc.getText())
Check this out -
soup = BeautifulSoup('''<div>
<div>Description</div>
<span>
<div><span>example text</span></div>
</span>
</div>''',"html.parser")
text = soup.span.get_text()
print(text)
Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='BodyText')
name_list_item = name_list.find_all('a')
for i in name_list_item:
names = name_list.contents[0]
print(names)
Then I ran it but nothing showed up in terminal except for a blank space like this:
Please help!! :<
the problem is in the for loop, you have to exctract content from i and not from name_list_item.
your working code should look like this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='BodyText')
name_list_item = name_list.find_all('a')
for i in name_list_item:
names = i.contents[0]
print(names)
I will suggest you to use the below approach to get the links.
(Actually the problem with your appraoch is that it also includes invalid data that we don't want, you can print and check). There are 32 names of type <class 'bs4.element.NavigableString'> which does not have contents, so it is printing 32 LF (ASCII value 10) characters.
Useful links »
How to find tags with only certain attributes - BeautifulSoup
How to find children of nodes using Beautiful Soup
Python: BeautifulSoup extract text from anchor tag
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> page = requests.get('https://web.archive.org/web/20121007172955/https://www
.nga.gov/collection/anZ1.htm')
>>>
>>> soup = BeautifulSoup(page.text, 'html.parser')
>>> name_list = soup.findAll("tr", {"valign": "top"})
>>>
>>> for name in name_list:
... print(name.find("a")["href"])
...
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3450
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=1986
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3451
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=20099
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3452
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34309
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=27191
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=5846
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3941
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3941
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3453
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=35173
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11133
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3455
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3454
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=961
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11597
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11597
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11631
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427
>>>
Thank you.
print(bsObj.find(id="mv-content-text").findAll("p")[0])
I use python3.6 to practice scrapy. the code is from the book,Web Scraping with Pyhon. why can't use find.().findAll()
your find(...) has returned None as a tag with id=mv-content-text wasn't found in the bsObj.
You can only call findAll on a bs4 object. You can explore what's going on here using a combination of type and hasattr to poke at the returned values inside a REPL
>>> from bs4 import BeautifulSoup
>>> doc = ['<html><head><title>Page title</title></head>',
... '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
... '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
... '</html>']
...
>>> soup = BeautifulSoup(''.join(doc), "lxml")
>>> tag = soup.find(id="firstpara")
>>> tag
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
>>> type(tag)
bs4.element.Tag
>>> hasattr(tag, 'findAll')
True
Attemping the same, but with a tag that doesn't exist within the HTML soup
>>> other = soup.find(id="non-existant")
>>> other
>>> type(other)
NoneType
>>> hasattr(other, 'findAll')
False
I am trying to append some content into the body of an html page with the Beautiful Soup python library.
>>> from bs4 import BeautifulSoup
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html.parser")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html.parser")
>>> print doc.html.body.ol
None
>>> doc.html.body.append(body)
>>> print doc.html.body.ol
None
After appending, I still am seeing the ol tag as empty.
>>> body.ol
<ol><li>1</li><li>2</li></ol>
>>> doc.html.body
<body><ol><li>1</li><li>2</li></ol></body>
>>>
However. you can see that the content appears to be there if I print the entire body tag. I feel like I do not quite understand the append operation.
Edit:
I do not know why, but it appears that I can append tags but not the root. For instance, doc.html.body.append(body.ol) works as I would expect it to. I can also do for tag in body.children: doc.html.body.append(tag).
My question is why does the root not append?
You should append the body.ol and not body. In other words, append a Tag instance and not a BeautifulSoup instance:
>>> from bs4 import BeautifulSoup
>>>
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html.parser")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html.parser")
>>>
>>> doc.html.body.append(body.ol)
>>>
>>> print(doc.html.body.ol)
<ol><li>1</li><li>2</li></ol>
Or, if you don't know what tag is gonna be the parent, use body.find().
You can also switch the parser to html5lib (requires html5lib to be installed):
>>> from bs4 import BeautifulSoup
>>>
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html5lib")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html5lib")
>>>
>>> doc.html.body.append(body)
>>> print(doc.html.body.ol)
<ol><li>1</li><li>2</li></ol>
<h2 class="hello-word">Google</h2>
How do I grab the value of the a tag (Google)?
print soup.select("h2 > a")
returns the entire a tag and I just want the value. Also, there could be multiple H2s on the page. How do I filter for the one with the class hello-word?
You can use .hello-word on h2 in the CSS Selector, to select only h2 tags with class hello-word and then select its child a . Also soup.select() returns a list of all possible matches, so you can easily iterate over it and call each elements .text to get the text. Example -
for i in soup.select("h2.hello-word > a"):
print(i.text)
Example/Demo (I added a few of my own elements , one with a slightly different class to show the working of the selector) -
>>> from bs4 import BeautifulSoup
>>> s = """<h2 class="hello-word">Google</h2>
... <h2 class="hello-word">Google12</h2>
... <h2 class="hello-word2">Google13</h2>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> for i in soup.select("h2.hello-word > a"):
... print(i.text)
...
Google
Google12
Try this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<h2 class="hello-word">Google</h2>', 'html.parser')
>>> soup.text
'Google'
You also can use lxml.html library instead
>>> import lxml.html
>>> from lxml.cssselect import CSSSelector
>>> txt = '<h2 class="hello-word">Google</h2>'
>>> tree = lxml.html.fromstring(txt)
>>> sel = CSSSelector('h2 > a')
>>> element = sel(tree)[0]
>>> element.text
Google