How do I get the value of a soup.select? - python

<h2 class="hello-word">Google</h2>
How do I grab the value of the a tag (Google)?
print soup.select("h2 > a")
returns the entire a tag and I just want the value. Also, there could be multiple H2s on the page. How do I filter for the one with the class hello-word?

You can use .hello-word on h2 in the CSS Selector, to select only h2 tags with class hello-word and then select its child a . Also soup.select() returns a list of all possible matches, so you can easily iterate over it and call each elements .text to get the text. Example -
for i in soup.select("h2.hello-word > a"):
print(i.text)
Example/Demo (I added a few of my own elements , one with a slightly different class to show the working of the selector) -
>>> from bs4 import BeautifulSoup
>>> s = """<h2 class="hello-word">Google</h2>
... <h2 class="hello-word">Google12</h2>
... <h2 class="hello-word2">Google13</h2>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> for i in soup.select("h2.hello-word > a"):
... print(i.text)
...
Google
Google12

Try this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<h2 class="hello-word">Google</h2>', 'html.parser')
>>> soup.text
'Google'
You also can use lxml.html library instead
>>> import lxml.html
>>> from lxml.cssselect import CSSSelector
>>> txt = '<h2 class="hello-word">Google</h2>'
>>> tree = lxml.html.fromstring(txt)
>>> sel = CSSSelector('h2 > a')
>>> element = sel(tree)[0]
>>> element.text
Google

Related

Beautifulsoup get content without next tag

I have some html code like this
<p><span class="map-sub-title">abc</span>123</p>
I used Beautifulsoup,and here's my code :
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text
I get the result 'abc123'
But I want to get the result '123' not 'abc123'
You can use the function decompose() to remove the span tag and then get the text you want.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")
for span in soup.find_all("span", {'class':'map-sub-title'}):
span.decompose()
print(soup.text)
You can also use extract() to remove unwanted tag before you get the text from tag like below.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()
print(soup1.text)
Although every response on this thread seems acceptable I shall point out another method for this case:
soup.find("span", {'class':'map-sub-title'}).next_sibling
You can use next_sibling to navigate between elements that are on the same parent, in this case the p tag.
One of the many ways, is to use contents over the parent tag (in this case it's <p>).
If you know the position of the string, you can directly use this:
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'
If, you want a generalized solution, where you don't know the position, you can check if the type of content is NavigableString like this:
>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']
With the second method, you'll be able to get all the text that is directly a child of the <p> tag. For completeness's sake, here's one more example:
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:
>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'

python 'NoneType' object has no attribute 'findAll'

print(bsObj.find(id="mv-content-text").findAll("p")[0])
I use python3.6 to practice scrapy. the code is from the book,Web Scraping with Pyhon. why can't use find.().findAll()
your find(...) has returned None as a tag with id=mv-content-text wasn't found in the bsObj.
You can only call findAll on a bs4 object. You can explore what's going on here using a combination of type and hasattr to poke at the returned values inside a REPL
>>> from bs4 import BeautifulSoup
>>> doc = ['<html><head><title>Page title</title></head>',
... '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
... '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
... '</html>']
...
>>> soup = BeautifulSoup(''.join(doc), "lxml")
>>> tag = soup.find(id="firstpara")
>>> tag
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
>>> type(tag)
bs4.element.Tag
>>> hasattr(tag, 'findAll')
True
Attemping the same, but with a tag that doesn't exist within the HTML soup
>>> other = soup.find(id="non-existant")
>>> other
>>> type(other)
NoneType
>>> hasattr(other, 'findAll')
False

Accessing Appended item with Beautiful Soup

I am trying to append some content into the body of an html page with the Beautiful Soup python library.
>>> from bs4 import BeautifulSoup
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html.parser")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html.parser")
>>> print doc.html.body.ol
None
>>> doc.html.body.append(body)
>>> print doc.html.body.ol
None
After appending, I still am seeing the ol tag as empty.
>>> body.ol
<ol><li>1</li><li>2</li></ol>
>>> doc.html.body
<body><ol><li>1</li><li>2</li></ol></body>
>>>
However. you can see that the content appears to be there if I print the entire body tag. I feel like I do not quite understand the append operation.
Edit:
I do not know why, but it appears that I can append tags but not the root. For instance, doc.html.body.append(body.ol) works as I would expect it to. I can also do for tag in body.children: doc.html.body.append(tag).
My question is why does the root not append?
You should append the body.ol and not body. In other words, append a Tag instance and not a BeautifulSoup instance:
>>> from bs4 import BeautifulSoup
>>>
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html.parser")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html.parser")
>>>
>>> doc.html.body.append(body.ol)
>>>
>>> print(doc.html.body.ol)
<ol><li>1</li><li>2</li></ol>
Or, if you don't know what tag is gonna be the parent, use body.find().
You can also switch the parser to html5lib (requires html5lib to be installed):
>>> from bs4 import BeautifulSoup
>>>
>>> doc = BeautifulSoup("<html><head></head><body></body></html>", "html5lib")
>>> body = BeautifulSoup("<ol><li>1</li><li>2</li></ol>", "html5lib")
>>>
>>> doc.html.body.append(body)
>>> print(doc.html.body.ol)
<ol><li>1</li><li>2</li></ol>

Get immediate parent tag with BeautifulSoup in Python

I've researched this question but haven't seen an actual solution to solving this. I'm using BeautifulSoup with Python and what I'm looking to do is get all image tags from a page, loop through each and check each to see if it's immediate parent is an anchor tag.
Here's some pseudo code:
html = BeautifulSoup(responseHtml)
for image in html.findAll('img'):
if (image.parent.name == 'a'):
image.hasParent = image.parent.link
Any ideas on this?
You need to check parent's name:
for img in soup.find_all('img'):
if img.parent.name == 'a':
print "Parent is a link"
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <body>
... <img src="image.png"/>
... </body>
... """
>>> soup = BeautifulSoup(data)
>>> img = soup.img
>>>
>>> img.parent.name
a
You can also retrieve the img tags that have a direct a parent using a CSS selector:
soup.select('a > img')

How to access a tag called "name" in BeautifulSoup

I want to access a tag called as "name" such as:
<contact><name>Yesügey</name><lastName>Yeşil</lastName><phone>+90 333 9695395</phone></contact>
Since "name" is a property of a BeautifulSoup tag object, I cannot access the child tag name:
>>> c1
<contact><name>Yesügey</name><lastname>Yeşil</lastname><phone>+90 333 9695395</p
hone></contact>
>>> c1.name
'contact'
>>> c1.lastname
<lastname>Yeşil</lastname>
You can try like this,
>>> soup=BeautifulSoup.BeautifulSoup(content).findAll('name')
>>> for field in soup:
... print field
...
<name>Yesügey</name>
Or
print soup.find('name').string
Here's what I got:
from bs4 import BeautifulSoup as BS
soup = '<contact><name>Yesügey</name><lastName>Yeşil</lastName><phone>+90 333 9695395</phone></contact>'
soup = BS(soup)
print soup.find('name').string
# Prints YesĂźgey
So instead of calling the name tag, I simply find it and get what's inside it :).
You can use the .find() method:
Examples:
c2.find('name')
<name>Yesügey</name>
c2.find('name').contents
Yesügey
Described is two different stratgies of accessing xml element name
>>> xmlstring = '<contact><name>Yesügey</name><lastName>Yeşil</lastName><phone>+90 333 9695395</phone></contact>'
>>> from BeautifulSoup import BeautifulSoup as Soup
>>> f = Soup(xmlstring)
>>> f.find('name')
<name>YesĂźgey</name>
>>> f.contact.name
u'contact'
>>>
Late answer, but I had the same issue when trying to find a <textarea name=COMMENTS>
My solution:
node = soup.find("textarea", attrs={"name": "COMMENTS"}

Categories

Resources