Can anyone help me traverse an html tree with beautiful soup?
I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
And only parse the value of href attribute of <a>, so only this part:
https://billing.anapp.com/
of:
Billing: Portal Home
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
The problem is find_all above, isn't make it far enough to the <a> element.
Any help is much appreciated.
Thank you.
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
prints:
https://billing.anapp.com/
h3.r a is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
I think it's worth mentioning what would happen in case there were similarly named classes that contain spaces.
Taking a piece of code that #Foo Bar User provided and changing it a little
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r s">
Billing: Portal Home
</h3>
<h3 class='r s sth s'>
Don't grab this
</h3>
"""
bs = BeautifulSoup(html)
when we try to get just the link where class equals 'r s' by css selectors:
elms = bs.select("h3.r.s a")
for i in elms:
print(i.attrs["href"])
it prints
https://billing.anapp.com/
https://link_you_dont_want.com/
however using
for elm in bs.find_all('h3',attrs={'class': 'r s'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
gives the desired result
https://billing.anapp.com/
That's just something I've encountered during my own work. If there is a way to overcome this using css selectors, please let me know!
Related
Currently trying to scrape the link and title from the following piece of HTML and cannot seem to find any way of doing it despite reading the scrapy docs for a while.
<h3 class="data">
</h3>
Whats the best way of doing this? Also I should note that there are many of these <h3> elements on the page with the same class but different <a> tags that I want to scrape.
Thanks in advance!
To get all the url within the h3 tags, you can use e.g
from scrapy import Selector
sel = Selector(text='''<h3 class="data">
</h3>''')
print(sel.css('h3.data > a::attr(href)').extract()) # you can use this
Output:
['example.com']
I am scraping a webpage using BeautifulSoup, and there is a piece of information I want that is contained in a <div> and sometimes only has a value for class and sometimes only has a value for style like below:
<div class="text-one">
Text I want
</div>
<div style="display-style">
Text I want
</div>
Using Selenium, I would grab be able to grab the text I want, regardless of how it's formatted on the page, by doing this:
driver.find_element_by_xpath(
".//div[contains(#class, 'text-one') or contains(#style, 'display-style')]"
).text
Right now I have a work around where I have an if statement to determine which selector to use to find the desired text (e.g. I do a string search of the raw HTML like:
if "<div style" in str(rawhtml):
want = soup.find("div", {"style": "display-style"}).text
else:
want = soup.find("div", {"class": "text-one"}).text
Is there an equivalent to the Selenium call I have above in BeautifulSoup? Or is determining the correct selector using an if-statement the only option?
You can use css OR syntax to specify to match either of those patterns.The "," is the OR operator. The [] indicates attribute selector and . class selector.
data = [i.text for i in soup.select("div.text-one, div[style='display-style']")]
I believe there's no support for xpath in beautifulsoup, only for css selectors. If you are heavily invested in xpaths, the similar library lxml could be used instead:
from lxml import html
dom = html.fromstring('<html><div class="text-one">test i want1</div><div style="display-style">text i want2</div></html>','html.parser')
selection = dom.xpath(".//div[contains(#class, 'text-one') or contains(#style, 'display-style')]")
[n.text for n in selection]
Response: ['test i want1', 'text i want2']
I am trying to extract specific nested div class and the corresponding h3 attribute (salary value).
So, I have tried the search by class method
soup.find_all('div',{'class':"vac_display_field"}
which returns an empty list.
Snippet code:
<div class="vac_display_field">
<h3>
Salary
</h3>
<div class="vac_display_field_value">
£27,951 - £30,859
</div>
</div>
Example here
First make sure you've instantiated your BeautifulSoup object correctly. Should look something like this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.civilservicejobs.service.gov.uk/csr/index.cgi?SID=cGFnZWNsYXNzPUpvYnMmb3duZXJ0eXBlPWZhaXImY3NvdXJjZT1jc3FzZWFyY2gmcGFnZWFjdGlvbj12aWV3dmFjYnlqb2JsaXN0JnNlYXJjaF9zbGljZV9jdXJyZW50PTEmdXNlcnNlYXJjaGNvbnRleHQ9MjczMzIwMTcmam9ibGlzdF92aWV3X3ZhYz0xNTEyMDAwJm93bmVyPTUwNzAwMDAmcmVxc2lnPTE0NzcxNTIyODItYjAyZmM4ZTgwNzQ2ZTA2NmY5OWM0OTBjMTZhMWNlNjhkZDMwZDU4NA=='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # the 'html.parser' part is optional.
Your code used to scrape the div tags looks correct (it's missing a closing parentheses, however). If, for some reason it still hasn't worked, try calling your find_all() method in this way:
soup.find_all('div', class_='vac_display_field')
If you look at the page's code, upon inspecting you'll find that the div tag you need is the second from the top:
Thus, your code can reflect that, using simple index notation:
Salary_info = soup.find_all(class_='vac_display_field')[1]
Then output the text:
for info in Salary_info:
print info.get_text()
HTH.
I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4
I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.