BeautifulSoup - How to extract text after specified string - python

I have HTML like:
<tr>
<td>Title:</td>
<td>Title value</td>
</tr>
I have to specify after which <td> with text i want to grab text of second <td>. Something like: Grab text of first next <td> after <td> which contain text Title:. Result should be: Title value
I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify.
I have tried this:
row = soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)
and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'

First of all, soup.find_all() returns a ResultSet which contains all the elements with tag td and string as Title: .
For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td , since you can get other elements in between (like a NavigableString)).
Example -
>>> from bs4 import BeautifulSoup
>>> s="""<tr>
... <td>Title:</td>
... <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row = soup.find_all('td', string='Title:')
>>> for r in row:
... nextSib = r.nextSibling
... while nextSib.name != 'td' and nextSib is not None:
... nextSib = nextSib.nextSibling
... print(nextSib.text)
...
Title value
Or you can use another library that has support for XPATH , and with Xpath you can do this easily. Other libraries like - lxml or xml.etree .

What you're intending to do is relatively easier with lxml using xpath. You can try something like this,
from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
if path_list[i].text == '<What you want>' and i != len(path_list) :
your_text = path_list[i+1].text

Related

How to check if BeautifulSoup tag is a certain tag?

If I find a certain tag using beautifulsoup:
styling = paragraphs.find_all('w:rpr')
I look at the next tag. I only want to use that tag if it is a <w:t> tag. How do I check what type of tag the next tag is?
I tried element.find_next_sibling().startswith('<w:t') for the element but it says NoneType object is not callable. I also tried element.find_next_sibling().find_all('<w:t'>) but it doesn't return anything.
for element in styling:
next = element.find_next_sibling()
if(#next is a <w:t> tag):
...
i am using beautifulsoup and would like to stick with it and not add eTree or other parser if possible with bs4.
Using item.name you can see tag's name.
Problem is that between tags there are elements NavigableString which are also treated as sibling elements and they gives None.
You would have to skip these elements or you could get all siblings and use for loop to find first <w:t> and exit loop with break
from bs4 import BeautifulSoup as BS
text = '''<div>
<w:rpr></w:rpr>
<w:t>A</w:t>
</div>'''
soup = BS(text, 'html.parser')
all_wrpr = soup.find_all('w:rpr')
for wrpr in all_wrpr:
next_tag = wrpr.next_sibling
print('name:', next_tag.name) # None
next_tag = wrpr.next_sibling.next_sibling
#next_tag = next_tag.next_sibling
print('name:', next_tag.name) # w:t
print('text:', next_tag.text) # A
#name: None
#name: w:t
#text: A
print('---')
all_siblings = wrpr.next_siblings
for item in all_siblings:
if item.name == 'w:t':
print('name:', item.name) # w:t
print('text:', item.text) # A
break # exit after first <w:t>
#name: w:t
#text: A
EDIT: If you test code with HTML formated little different
text = '''<div>
<w:rpr></w:rpr><w:t>A</w:t>
</div>'''
then there will be no NavigableString between tags and first method will fail but second method will still work.

Is there any strict findAll function in BeautifulSoup?

I am using Python- 2.7 and BeautifulSoup
Apologies if I am unable to explain what exactly I want
There is this html page in which data is embedded in specific structure
I want to pull the data ignoring the first block
But the problem is when I do-
self.tab = soup.findAll("div","listing-row")
It also gives me the first block which is actually (unwanted html block)-
("div","listing-row wide-featured-listing")
I am not using
soup.find("div","listing-row")
since I want all the classes named "listing-row" only in that entire page.
How can I ignore the class named "listing-row wide-featured-listing"?
Help/Guidance in any form is appreciated. Thanks a lot !
Or, you may make a CSS selector to match the class exactly to listing-row:
soup.select("div[class=listing-row]")
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <div class="listing-row">result1</div>
... <div class="listing-row wide-featured-listing">result2</div>
... <div class="listing-row">result3</div>
... </div>
... """
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>> print [row.text for row in soup.select("div[class=listing-row]")]
[u'result1', u'result3']
You could just filter out that element:
self.tab = [el for el in soup.find_all('div', class_='listing-row')
if 'wide-featured-listing' not in el.attr['class']]
You could use a custom function:
self.tab = soup.find_all(lambda e: e.name == 'div' and
'listing-row' in e['class'] and
'wide-featured-listing' not in el.attr['class'])

How do I get the value of a soup.select?

<h2 class="hello-word">Google</h2>
How do I grab the value of the a tag (Google)?
print soup.select("h2 > a")
returns the entire a tag and I just want the value. Also, there could be multiple H2s on the page. How do I filter for the one with the class hello-word?
You can use .hello-word on h2 in the CSS Selector, to select only h2 tags with class hello-word and then select its child a . Also soup.select() returns a list of all possible matches, so you can easily iterate over it and call each elements .text to get the text. Example -
for i in soup.select("h2.hello-word > a"):
print(i.text)
Example/Demo (I added a few of my own elements , one with a slightly different class to show the working of the selector) -
>>> from bs4 import BeautifulSoup
>>> s = """<h2 class="hello-word">Google</h2>
... <h2 class="hello-word">Google12</h2>
... <h2 class="hello-word2">Google13</h2>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> for i in soup.select("h2.hello-word > a"):
... print(i.text)
...
Google
Google12
Try this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<h2 class="hello-word">Google</h2>', 'html.parser')
>>> soup.text
'Google'
You also can use lxml.html library instead
>>> import lxml.html
>>> from lxml.cssselect import CSSSelector
>>> txt = '<h2 class="hello-word">Google</h2>'
>>> tree = lxml.html.fromstring(txt)
>>> sel = CSSSelector('h2 > a')
>>> element = sel(tree)[0]
>>> element.text
Google

Get an attribute value based on the name attribute with BeautifulSoup

I want to print an attribute value based on its name, take for example
<META NAME="City" content="Austin">
I want to do something like this
soup = BeautifulSoup(f) # f is some HTML containing the above meta tag
for meta_tag in soup("meta"):
if meta_tag["name"] == "City":
print(meta_tag["content"])
The above code give a KeyError: 'name', I believe this is because name is used by BeatifulSoup so it can't be used as a keyword argument.
It's pretty simple, use the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<META NAME="City" content="Austin">')
>>> soup.find("meta", {"name":"City"})
<meta name="City" content="Austin" />
>>> soup.find("meta", {"name":"City"})['content']
'Austin'
theharshest answered the question but here is another way to do the same thing.
Also, In your example you have NAME in caps and in your code you have name in lowercase.
s = '<div class="question" id="get attrs" name="python" x="something">Hello World</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div').attrs
print attributes_dictionary
# prints: {'id': 'get attrs', 'x': 'something', 'class': ['question'], 'name': 'python'}
print attributes_dictionary['class'][0]
# prints: question
print soup.find('div').get_text()
# prints: Hello World
6 years late to the party but I've been searching for how to extract an html element's tag attribute value, so for:
<span property="addressLocality">Ayr</span>
I want "addressLocality". I kept being directed back here, but the answers didn't really solve my problem.
How I managed to do it eventually:
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs('<span property="addressLocality">Ayr</span>', 'html.parser')
>>> my_attributes = soup.find().attrs
>>> my_attributes
{u'property': u'addressLocality'}
As it's a dict, you can then also use keys and 'values'
>>> my_attributes.keys()
[u'property']
>>> my_attributes.values()
[u'addressLocality']
Hopefully it helps someone else!
theharshest's answer is the best solution, but FYI the problem you were encountering has to do with the fact that a Tag object in Beautiful Soup acts like a Python dictionary. If you access tag['name'] on a tag that doesn't have a 'name' attribute, you'll get a KeyError.
The following works:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<META NAME="City" content="Austin">', 'html.parser')
metas = soup.find_all("meta")
for meta in metas:
print meta.attrs['content'], meta.attrs['name']
One can also try this solution :
To find the value, which is written in span of table
htmlContent
<table>
<tr>
<th>
ID
</th>
<th>
Name
</th>
</tr>
<tr>
<td>
<span name="spanId" class="spanclass">ID123</span>
</td>
<td>
<span>Bonny</span>
</td>
</tr>
</table>
Python code
soup = BeautifulSoup(htmlContent, "lxml")
soup.prettify()
tables = soup.find_all("table")
for table in tables:
storeValueRows = table.find_all("tr")
thValue = storeValueRows[0].find_all("th")[0].string
if (thValue == "ID"): # with this condition I am verifying that this html is correct, that I wanted.
value = storeValueRows[1].find_all("span")[0].string
value = value.strip()
# storeValueRows[1] will represent <tr> tag of table located at first index and find_all("span")[0] will give me <span> tag and '.string' will give me value
# value.strip() - will remove space from start and end of the string.
# find using attribute :
value = storeValueRows[1].find("span", {"name":"spanId"})['class']
print value
# this will print spanclass
If tdd='<td class="abc"> 75</td>'
In Beautifulsoup
if(tdd.has_attr('class')):
print(tdd.attrs['class'][0])
Result: abc

How to find tags with only certain attributes - BeautifulSoup

How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?
For example, I want to find all <td valign="top"> tags.
The following code:
raw_card_data = soup.fetch('td', {'valign':re.compile('top')})
gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top
I also tried:
raw_card_data = soup.findAll(re.compile('<td valign="top">'))
and this returns nothing (probably because of bad regex)
I was wondering if there was a way in BeautifulSoup to say "Find <td> tags whose only attribute is valign:top"
UPDATE
FOr example, if an HTML document contained the following <td> tags:
<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />
I would want only the first <td> tag (<td width="580" valign="top">) to return
As explained on the BeautifulSoup documentation
You may use this :
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
EDIT :
To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :
from BeautifulSoup import BeautifulSoup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
for result in results :
if len(result.attrs) == 1 :
print result
That returns :
<td valign="top">.....</td>
You can use lambda functions in findAll as explained in documentation. So that in your case to search for td tag with only valign = "top" use following:
td_tag_list = soup.findAll(
lambda tag:tag.name == "td" and
len(tag.attrs) == 1 and
tag["valign"] == "top")
if you want to only search with attribute name with any value
from bs4 import BeautifulSoup
import re
soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})
as per Steve Lorimer better to pass True instead of regex
results = soup.findAll("td", {"valign" : True})
The easiest way to do this is with the new CSS style select method:
soup = BeautifulSoup(html)
results = soup.select('td[valign="top"]')
Just pass it as an argument of findAll:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <html>
... <head><title>My Title!</title></head>
... <body><table>
... <tr><td>First!</td>
... <td valign="top">Second!</td></tr>
... </table></body><html>
... """)
>>>
>>> soup.findAll('td')
[<td>First!</td>, <td valign="top">Second!</td>]
>>>
>>> soup.findAll('td', valign='top')
[<td valign="top">Second!</td>]
Adding a combination of Chris Redford's and Amr's answer, you can also search for an attribute name with any value with the select command:
from bs4 import BeautifulSoup as Soup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = Soup(html, 'lxml')
results = soup.select('td[valign]')
If you are looking to pull all tags where a particular attribute is present at all, you can use the same code as the accepted answer, but instead of specifying a value for the tag, just put True.
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : True})
This will return all td tags that have valign attributes. This is useful if your project involves pulling info from a tag like div that is used all over, but can handle very specific attributes that you might be looking for.
find using an attribute in any tag
<th class="team" data-sort="team">Team</th>
soup.find_all(attrs={"class": "team"})
<th data-sort="team">Team</th>
soup.find_all(attrs={"data-sort": "team"})

Categories

Resources