Parse href attribute value from element with Beautifulsoup and Mechanize - python

Can anyone help me traverse an html tree with beautiful soup?
I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
And only parse the value of href attribute of <a>, so only this part:
https://billing.anapp.com/
of:
Billing: Portal Home
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
The problem is find_all above, isn't make it far enough to the <a> element.
Any help is much appreciated.
Thank you.

from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
prints:
https://billing.anapp.com/
h3.r a is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])

I think it's worth mentioning what would happen in case there were similarly named classes that contain spaces.
Taking a piece of code that #Foo Bar User provided and changing it a little
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r s">
Billing: Portal Home
</h3>
<h3 class='r s sth s'>
Don't grab this
</h3>
"""
bs = BeautifulSoup(html)
when we try to get just the link where class equals 'r s' by css selectors:
elms = bs.select("h3.r.s a")
for i in elms:
print(i.attrs["href"])
it prints
https://billing.anapp.com/
https://link_you_dont_want.com/
however using
for elm in bs.find_all('h3',attrs={'class': 'r s'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
gives the desired result
https://billing.anapp.com/
That's just something I've encountered during my own work. If there is a way to overcome this using css selectors, please let me know!

Related

Scrapy, get a href from inside a H3 tag?

Currently trying to scrape the link and title from the following piece of HTML and cannot seem to find any way of doing it despite reading the scrapy docs for a while.
<h3 class="data">
</h3>
Whats the best way of doing this? Also I should note that there are many of these <h3> elements on the page with the same class but different <a> tags that I want to scrape.
Thanks in advance!
To get all the url within the h3 tags, you can use e.g
from scrapy import Selector
sel = Selector(text='''<h3 class="data">
</h3>''')
print(sel.css('h3.data > a::attr(href)').extract()) # you can use this
Output:
['example.com']

Can you write a css-selector in BeautifulSoup that uses either the class or style to identify the desired info in a div?

I am scraping a webpage using BeautifulSoup, and there is a piece of information I want that is contained in a <div> and sometimes only has a value for class and sometimes only has a value for style like below:
<div class="text-one">
Text I want
</div>
<div style="display-style">
Text I want
</div>
Using Selenium, I would grab be able to grab the text I want, regardless of how it's formatted on the page, by doing this:
driver.find_element_by_xpath(
".//div[contains(#class, 'text-one') or contains(#style, 'display-style')]"
).text
Right now I have a work around where I have an if statement to determine which selector to use to find the desired text (e.g. I do a string search of the raw HTML like:
if "<div style" in str(rawhtml):
want = soup.find("div", {"style": "display-style"}).text
else:
want = soup.find("div", {"class": "text-one"}).text
Is there an equivalent to the Selenium call I have above in BeautifulSoup? Or is determining the correct selector using an if-statement the only option?
You can use css OR syntax to specify to match either of those patterns.The "," is the OR operator. The [] indicates attribute selector and . class selector.
data = [i.text for i in soup.select("div.text-one, div[style='display-style']")]
I believe there's no support for xpath in beautifulsoup, only for css selectors. If you are heavily invested in xpaths, the similar library lxml could be used instead:
from lxml import html
dom = html.fromstring('<html><div class="text-one">test i want1</div><div style="display-style">text i want2</div></html>','html.parser')
selection = dom.xpath(".//div[contains(#class, 'text-one') or contains(#style, 'display-style')]")
[n.text for n in selection]
Response: ['test i want1', 'text i want2']

HTML parsing , nested div issue using BeautifulSoup

I am trying to extract specific nested div class and the corresponding h3 attribute (salary value).
So, I have tried the search by class method
soup.find_all('div',{'class':"vac_display_field"}
which returns an empty list.
Snippet code:
<div class="vac_display_field">
<h3>
Salary
</h3>
<div class="vac_display_field_value">
£27,951 - £30,859
</div>
</div>
Example here
First make sure you've instantiated your BeautifulSoup object correctly. Should look something like this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.civilservicejobs.service.gov.uk/csr/index.cgi?SID=cGFnZWNsYXNzPUpvYnMmb3duZXJ0eXBlPWZhaXImY3NvdXJjZT1jc3FzZWFyY2gmcGFnZWFjdGlvbj12aWV3dmFjYnlqb2JsaXN0JnNlYXJjaF9zbGljZV9jdXJyZW50PTEmdXNlcnNlYXJjaGNvbnRleHQ9MjczMzIwMTcmam9ibGlzdF92aWV3X3ZhYz0xNTEyMDAwJm93bmVyPTUwNzAwMDAmcmVxc2lnPTE0NzcxNTIyODItYjAyZmM4ZTgwNzQ2ZTA2NmY5OWM0OTBjMTZhMWNlNjhkZDMwZDU4NA=='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # the 'html.parser' part is optional.
Your code used to scrape the div tags looks correct (it's missing a closing parentheses, however). If, for some reason it still hasn't worked, try calling your find_all() method in this way:
soup.find_all('div', class_='vac_display_field')
If you look at the page's code, upon inspecting you'll find that the div tag you need is the second from the top:
Thus, your code can reflect that, using simple index notation:
Salary_info = soup.find_all(class_='vac_display_field')[1]
Then output the text:
for info in Salary_info:
print info.get_text()
HTH.

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4

Parsing out data using BeautifulSoup in Python

I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.

Categories

Resources