I am trying to append some content into the body of an html page with the Beautiful Soup python library.
>>> from bs4 import BeautifulSoup
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html.parser")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html.parser")
>>> print doc.html.body.ol
None
>>> doc.html.body.append(body)
>>> print doc.html.body.ol
None
After appending, I still am seeing the ol tag as empty.
>>> body.ol
<ol><li>1</li><li>2</li></ol>
>>> doc.html.body
<body><ol><li>1</li><li>2</li></ol></body>
>>>
However. you can see that the content appears to be there if I print the entire body tag. I feel like I do not quite understand the append operation.
Edit:
I do not know why, but it appears that I can append tags but not the root. For instance, doc.html.body.append(body.ol) works as I would expect it to. I can also do for tag in body.children: doc.html.body.append(tag).
My question is why does the root not append?
You should append the body.ol and not body. In other words, append a Tag instance and not a BeautifulSoup instance:
>>> from bs4 import BeautifulSoup
>>>
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html.parser")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html.parser")
>>>
>>> doc.html.body.append(body.ol)
>>>
>>> print(doc.html.body.ol)
<ol><li>1</li><li>2</li></ol>
Or, if you don't know what tag is gonna be the parent, use body.find().
You can also switch the parser to html5lib (requires html5lib to be installed):
>>> from bs4 import BeautifulSoup
>>>
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html5lib")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html5lib")
>>>
>>> doc.html.body.append(body)
>>> print(doc.html.body.ol)
<ol><li>1</li><li>2</li></ol>
Related
Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='BodyText')
name_list_item = name_list.find_all('a')
for i in name_list_item:
names = name_list.contents[0]
print(names)
Then I ran it but nothing showed up in terminal except for a blank space like this:
Please help!! :<
the problem is in the for loop, you have to exctract content from i and not from name_list_item.
your working code should look like this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='BodyText')
name_list_item = name_list.find_all('a')
for i in name_list_item:
names = i.contents[0]
print(names)
I will suggest you to use the below approach to get the links.
(Actually the problem with your appraoch is that it also includes invalid data that we don't want, you can print and check). There are 32 names of type <class 'bs4.element.NavigableString'> which does not have contents, so it is printing 32 LF (ASCII value 10) characters.
Useful links »
How to find tags with only certain attributes - BeautifulSoup
How to find children of nodes using Beautiful Soup
Python: BeautifulSoup extract text from anchor tag
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> page = requests.get('https://web.archive.org/web/20121007172955/https://www
.nga.gov/collection/anZ1.htm')
>>>
>>> soup = BeautifulSoup(page.text, 'html.parser')
>>> name_list = soup.findAll("tr", {"valign": "top"})
>>>
>>> for name in name_list:
... print(name.find("a")["href"])
...
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3450
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=1986
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3451
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=20099
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3452
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34309
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=27191
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=5846
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3941
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3941
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3453
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=35173
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11133
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3455
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3454
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=961
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11597
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11597
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11631
/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427
>>>
Thank you.
I have some html code like this
<p><span class="map-sub-title">abc</span>123</p>
I used Beautifulsoup,and here's my code :
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text
I get the result 'abc123'
But I want to get the result '123' not 'abc123'
You can use the function decompose() to remove the span tag and then get the text you want.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")
for span in soup.find_all("span", {'class':'map-sub-title'}):
span.decompose()
print(soup.text)
You can also use extract() to remove unwanted tag before you get the text from tag like below.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()
print(soup1.text)
Although every response on this thread seems acceptable I shall point out another method for this case:
soup.find("span", {'class':'map-sub-title'}).next_sibling
You can use next_sibling to navigate between elements that are on the same parent, in this case the p tag.
One of the many ways, is to use contents over the parent tag (in this case it's <p>).
If you know the position of the string, you can directly use this:
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'
If, you want a generalized solution, where you don't know the position, you can check if the type of content is NavigableString like this:
>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']
With the second method, you'll be able to get all the text that is directly a child of the <p> tag. For completeness's sake, here's one more example:
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:
>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'
I've been on this for the last two days non-stop...
I'm trying to get a specific div by its ID using BeautifulSoup as so:
import requests
from bs4 import BeautifulSoup
r = requests.get('www.example.com', cookies=cookies_dict)
soup = BeautifulSoup(r.content, 'html.parser')
div_text = soup.get('div', {'id': 'this_div_id'}).text
print div_text
All I get is a dictionary:
{'id': 'this_div_id'}
Now, I checked to make sure that 'this_div_id' actually is inside of r.content:
>>> 'this_div_id' in r.content
True
I'd be glad to receive any help and suggestions.
Err... Maybe you should check BeautifulSoup documentation again ?-)
Help on method get in module bs4.element:
get(self, key, default=None) unbound bs4.BeautifulSoup method
Returns the value of the 'key' attribute for the tag, or
the value given for 'default' if it doesn't have that
attribute.
I think you want the find() method instead:
>>> html = """<html><body><div><div><div id='this_div_id'>haha</div></div></div>"""
>>> from bs4 import BeautifulSoup
>>> s = BeautifulSoup(html, 'html.parser')
>>> s.find("div")
<div><div><div id="this_div_id">haha</div></div></div>
>>> s.find("div", id="this_div_id")
*<div id="this_div_id">haha</div>
>>>
<h2 class="hello-word">Google</h2>
How do I grab the value of the a tag (Google)?
print soup.select("h2 > a")
returns the entire a tag and I just want the value. Also, there could be multiple H2s on the page. How do I filter for the one with the class hello-word?
You can use .hello-word on h2 in the CSS Selector, to select only h2 tags with class hello-word and then select its child a . Also soup.select() returns a list of all possible matches, so you can easily iterate over it and call each elements .text to get the text. Example -
for i in soup.select("h2.hello-word > a"):
print(i.text)
Example/Demo (I added a few of my own elements , one with a slightly different class to show the working of the selector) -
>>> from bs4 import BeautifulSoup
>>> s = """<h2 class="hello-word">Google</h2>
... <h2 class="hello-word">Google12</h2>
... <h2 class="hello-word2">Google13</h2>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> for i in soup.select("h2.hello-word > a"):
... print(i.text)
...
Google
Google12
Try this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<h2 class="hello-word">Google</h2>', 'html.parser')
>>> soup.text
'Google'
You also can use lxml.html library instead
>>> import lxml.html
>>> from lxml.cssselect import CSSSelector
>>> txt = '<h2 class="hello-word">Google</h2>'
>>> tree = lxml.html.fromstring(txt)
>>> sel = CSSSelector('h2 > a')
>>> element = sel(tree)[0]
>>> element.text
Google
I want to access a tag called as "name" such as:
<contact><name>Yesügey</name><lastName>Yeşil</lastName><phone>+90 333 9695395</phone></contact>
Since "name" is a property of a BeautifulSoup tag object, I cannot access the child tag name:
>>> c1
<contact><name>Yesügey</name><lastname>Yeşil</lastname><phone>+90 333 9695395</p
hone></contact>
>>> c1.name
'contact'
>>> c1.lastname
<lastname>Yeşil</lastname>
You can try like this,
>>> soup=BeautifulSoup.BeautifulSoup(content).findAll('name')
>>> for field in soup:
... print field
...
<name>Yesügey</name>
Or
print soup.find('name').string
Here's what I got:
from bs4 import BeautifulSoup as BS
soup = '<contact><name>Yesügey</name><lastName>Yeşil</lastName><phone>+90 333 9695395</phone></contact>'
soup = BS(soup)
print soup.find('name').string
# Prints YesĂźgey
So instead of calling the name tag, I simply find it and get what's inside it :).
You can use the .find() method:
Examples:
c2.find('name')
<name>Yesügey</name>
c2.find('name').contents
Yesügey
Described is two different stratgies of accessing xml element name
>>> xmlstring = '<contact><name>Yesügey</name><lastName>Yeşil</lastName><phone>+90 333 9695395</phone></contact>'
>>> from BeautifulSoup import BeautifulSoup as Soup
>>> f = Soup(xmlstring)
>>> f.find('name')
<name>YesĂźgey</name>
>>> f.contact.name
u'contact'
>>>
Late answer, but I had the same issue when trying to find a <textarea name=COMMENTS>
My solution:
node = soup.find("textarea", attrs={"name": "COMMENTS"}