Python how to parsing HTML with BS4 - python

<div class="stuff">
<div class="this">K/D</div>
<div class="that">8.66</div>
( If not clear the two divs below the top div are its children )
I'm currently trying to parse for 8.66 and I have made many attempts to parse for it using lxml and beautifulsoup. I tried running a loop to search for that value but it seems like nothing works!
If you can help please do I am absolutely lost on how to do this. Thank you in advance!!

You can specify the class value:
from bs4 import BeautifulSoup as soup
d = """
<div class="stuff">
<div class="this">K/D</div>
<div class="that">8.66</div>
"""
s = soup(d, 'html.parser')
print(s.find('div', {'class':'that'}).text)
Output:
8.66

Related

How to get data from nested HTML using BeautifulSoup in Django

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)
This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

Beautifulsoup find_all() captures too much text

I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?
Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]
from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>
What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.
Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01
You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

Beautiful Soup Access the second <div with the same class

I'm scraping an html document that contains two 'hooks' of the same class like below:
<div class="multiRow">
<!--ModuleId 372329FileName #swMultiRowsContainer-->
<some more content>
</div>
<div class="multiRow">
<!--ModuleId 372330FileName #multiRowsContainer-->
<some more content>
</div>
When I do:
mr = ct[0].find_all('div', {'class': 'multiRow'})
I only get contents from the first
Is there a way to get access to contents within the second ?
Thanks!
Edit with Adam Smith's comment.
Refer to my comment above, code below:
from bs4 import BeautifulSoup as soup
a = "<div class=\"multiRow\"><!--ModuleId 372329FileName #swMultiRowsContainer-->Bye</div> <div class=\"multiRow\"><!--ModuleId 372330FileName #multiRowsContainer-->Hi</div>"
print soup(a).find_all("div",{"class":"multiRow"})[1]
returns:
<div class="multiRow"><!--ModuleId 372330FileName #multiRowsContainer-->Hi</div>
Coding example for Adam Smith's comment. I think it is very clear.
ct= soup.findAll("div", {"class" : "multiRow"})
ct= ct[1]
print(ct)
Because you are asking for the first content only, check your code
ct[0].find_all
The ct[0] will grab only the first content, not the whole. Fix that.

Find specific link w/ beautifulsoup

Hi I cannot figure out how to find links which begin with certain text for the life of me.
findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with
http://www.nhl.com/ice/boxscore.htm?id=
Can anyone help me?
Thank you very much
First set up a test document and open up the parser with BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div>yep</div><div>somelink</div>another</body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
<body>
<div>
<a href="something">
yep
</a>
</div>
<div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=3">
somelink
</a>
</div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=7">
another
</a>
</body>
</html>
Next, we can search for all <a> tags with an href attribute starting with http://www.nhl.com/ice/boxscore.htm?id=. You can use a regular expression for it:
>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[somelink, another]
You might not need BeautifulSoup since your search is specific
>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))
You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.
listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []
for link in listOfAllLinks:
if "www.nhl.com" in link:
listOfLinksINeed.append(link['href'])

Categories

Resources