I'm scraping an html document that contains two 'hooks' of the same class like below:
<div class="multiRow">
<!--ModuleId 372329FileName #swMultiRowsContainer-->
<some more content>
</div>
<div class="multiRow">
<!--ModuleId 372330FileName #multiRowsContainer-->
<some more content>
</div>
When I do:
mr = ct[0].find_all('div', {'class': 'multiRow'})
I only get contents from the first
Is there a way to get access to contents within the second ?
Thanks!
Edit with Adam Smith's comment.
Refer to my comment above, code below:
from bs4 import BeautifulSoup as soup
a = "<div class=\"multiRow\"><!--ModuleId 372329FileName #swMultiRowsContainer-->Bye</div> <div class=\"multiRow\"><!--ModuleId 372330FileName #multiRowsContainer-->Hi</div>"
print soup(a).find_all("div",{"class":"multiRow"})[1]
returns:
<div class="multiRow"><!--ModuleId 372330FileName #multiRowsContainer-->Hi</div>
Coding example for Adam Smith's comment. I think it is very clear.
ct= soup.findAll("div", {"class" : "multiRow"})
ct= ct[1]
print(ct)
Because you are asking for the first content only, check your code
ct[0].find_all
The ct[0] will grab only the first content, not the whole. Fix that.
Related
I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?
Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]
from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>
What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.
<div class="stuff">
<div class="this">K/D</div>
<div class="that">8.66</div>
( If not clear the two divs below the top div are its children )
I'm currently trying to parse for 8.66 and I have made many attempts to parse for it using lxml and beautifulsoup. I tried running a loop to search for that value but it seems like nothing works!
If you can help please do I am absolutely lost on how to do this. Thank you in advance!!
You can specify the class value:
from bs4 import BeautifulSoup as soup
d = """
<div class="stuff">
<div class="this">K/D</div>
<div class="that">8.66</div>
"""
s = soup(d, 'html.parser')
print(s.find('div', {'class':'that'}).text)
Output:
8.66
Can anyone help me traverse an html tree with beautiful soup?
I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
And only parse the value of href attribute of <a>, so only this part:
https://billing.anapp.com/
of:
Billing: Portal Home
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
The problem is find_all above, isn't make it far enough to the <a> element.
Any help is much appreciated.
Thank you.
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
Billing: Portal Home
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
prints:
https://billing.anapp.com/
h3.r a is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
I think it's worth mentioning what would happen in case there were similarly named classes that contain spaces.
Taking a piece of code that #Foo Bar User provided and changing it a little
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r s">
Billing: Portal Home
</h3>
<h3 class='r s sth s'>
Don't grab this
</h3>
"""
bs = BeautifulSoup(html)
when we try to get just the link where class equals 'r s' by css selectors:
elms = bs.select("h3.r.s a")
for i in elms:
print(i.attrs["href"])
it prints
https://billing.anapp.com/
https://link_you_dont_want.com/
however using
for elm in bs.find_all('h3',attrs={'class': 'r s'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
gives the desired result
https://billing.anapp.com/
That's just something I've encountered during my own work. If there is a way to overcome this using css selectors, please let me know!
Hi I cannot figure out how to find links which begin with certain text for the life of me.
findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with
http://www.nhl.com/ice/boxscore.htm?id=
Can anyone help me?
Thank you very much
First set up a test document and open up the parser with BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div>yep</div><div>somelink</div>another</body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
<body>
<div>
<a href="something">
yep
</a>
</div>
<div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=3">
somelink
</a>
</div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=7">
another
</a>
</body>
</html>
Next, we can search for all <a> tags with an href attribute starting with http://www.nhl.com/ice/boxscore.htm?id=. You can use a regular expression for it:
>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[somelink, another]
You might not need BeautifulSoup since your search is specific
>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))
You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.
listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []
for link in listOfAllLinks:
if "www.nhl.com" in link:
listOfLinksINeed.append(link['href'])
I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.