I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.
Related
I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])
Trying to get all tags in a website, I have this piece of code:
results=[]
all_links = soup.find_all('article')
for link in all_links:
print link.find('div', class_="cb-category cb-byline-element")
This way I get scraped data displayed in the following manner (with ',', separating <a> tags):
<div class="cb-category cb-byline-element"><i class="fa fa-folder-o"></i> Canadian, Garage, Listen, Music, Psychedelic, Under 2000</div>
however, if I do the following:
results.append(link.find('div', class_="cb-category cb-byline-element"))
for link in results:
link.find('a', href=True)['href']
I get only the first <a> for each block of <div>, like so:
http://ridethetempo.com/category/canadian/
How do recursively retrieve all <a> tags, so I end up with this result?
http://ridethetempo.com/category/canadian/
http://ridethetempo.com/category/music/garage-rock/
http://ridethetempo.com/category/listen-2/
http://ridethetempo.com/category/music/
http://ridethetempo.com/category/music/psychedelic/
http://ridethetempo.com/category/under-2000/
for link in soup.find_all('a'):
print(link.get('href'))
Will print all the 'a' tagged elements
I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4
Can anyone help me traverse an html tree with beautiful soup?
I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
And only parse the value of href attribute of <a>, so only this part:
https://billing.anapp.com/
of:
Billing: Portal Home
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
The problem is find_all above, isn't make it far enough to the <a> element.
Any help is much appreciated.
Thank you.
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
prints:
https://billing.anapp.com/
h3.r a is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
I think it's worth mentioning what would happen in case there were similarly named classes that contain spaces.
Taking a piece of code that #Foo Bar User provided and changing it a little
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r s">
Billing: Portal Home
</h3>
<h3 class='r s sth s'>
Don't grab this
</h3>
"""
bs = BeautifulSoup(html)
when we try to get just the link where class equals 'r s' by css selectors:
elms = bs.select("h3.r.s a")
for i in elms:
print(i.attrs["href"])
it prints
https://billing.anapp.com/
https://link_you_dont_want.com/
however using
for elm in bs.find_all('h3',attrs={'class': 'r s'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
gives the desired result
https://billing.anapp.com/
That's just something I've encountered during my own work. If there is a way to overcome this using css selectors, please let me know!
all. I have a quick question about BeautifulSoup with Python. I have several bits of HTML that look like this (the only differences are the links and product names) and I'm trying to get the link from the "href" attribute.
<div id="productListing1" xmlns:dew="urn:Microsoft.Search.Response.Document">
<span id="rank" style="display:none;">94.36</span>
<div class="productPhoto">
<img src="/assets/images/ocpimages/87684/00131cl.gif" height="82" width="82" />
</div>
<div class="productName">
<a class="on" href="/Products/ProductInfoDisplay.aspx?SiteId=1&Product=8768400131">CAPRI SUN - JUICE DRINK - COOLERS VARIETY PACK 6 OZ</a>
</div>
<div class="size">40 CT</div>
I currently have this Python code:
productLinks = soup.findAll('a', attrs={'class' : 'on'})
for link in productLinks:
print link['href']
This works (for every link on the page I get something like /Products/ProductInfoDisplay.aspx?SiteId=1&Product=8768400131); however, I've been trying to figure out if there's a way to get the link in the "href" attribute without searching explicitly for 'class="on"'. I guess my first question should be whether or not this is the best way to find this information (class="on" seems too generic and likely to break in the future although my CSS and HTML skills aren't that good). I've tried numerous combinations of find, findAll, findAllnext, etc. methods but I can't quite make it work. This is mostly what I had (I rearranged and changed it numerous times):
productLinks = soup.find('div', attrs={'class' : 'productName'}).find('a', href=True)
If this isn't a good way to do this, how can I get to the <a> tag from the <div class="productName"> tag? Let me know if you need more information.
Thank you.
Well, once you have the <div>, element, you can get the <a> subelement by calling find():
productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
print div.find('a')['href']
However, since the <a> is immediately above <div>, you can get the a attribute from the div:
productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
print div.a['href']
Now, if you want to put all the <a> elements in a list, your code above will not work because find() just returns one element matched by its criteria. You would get the list of divs and get the subelements from them, for example, using list comprehensions:
productLinks = [div.a for div in
soup.findAll('div', attrs={'class' : 'productName'})]
for link in productLinks:
print link['href']
I am giving this solution in BeautifulSoup4
for data in soup.find_all('div', class_='productName'):
for a in data.find_all('a'):
print(a.get('href')) #for getting link
print(a.text) #for getting text between the link
You can avoid those for loops by specifying the index.
data = soup.find_all('div', class_='productName')
a_class = data[0].find_all('a')
url_ = a_class[0].get('href')
print(url_)