How to get data from nested HTML using BeautifulSoup in Django - python

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)

This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

Related

How do I specify which a tag I need when scraping in Python?

I am using BeautifulSoup...
When I run this code:
inside_branding_info = container.div.find("div", "item-branding")
print(inside_branding_info)
It returns:
div class="item-branding">
<a class="item-rating" href="https://www.newegg.com/gigabyte-geforce-rtx-2060-super-gv-n206swf2oc-8gd/p/N82E16814932174?cm_sp=SearchSuccess-_-INFOCARD-_-graphics+cards-_-14-932-174-_-1&Description=graphics+cards&IsFeedbackTab=true#scrollFullInfo"><i class="rating rating-4"></i><span class="item-rating-num">(12)</span></a>
</div>
However, in the HTML inspection this is what I see:
Raw Site HTML
Everytime I run:
inside_branding_info.a.img["title"]
...python thinks I want the "a" tag "item-rating"...not the "a" href tag nested inside of the div "item-branding".
How do I get inside of the "a href" tag, then into the "img", to finally extract the "title" (title = "MSI")? I want the title/brand of the item on the website. I am new to Python. I have only used R and SQL before this instance, any help would be greatly appreciated.
You need a selector path .
Accroding to the img you provided...
soup = BeautifulSoup(data)
img = soup.select('.item-brand > img')
print(img['title'])
The above should work for you.
Try the following
from bs4 import BeautifulSoup
html = """<div class="item-branding">
<a href="https://www.newegg.com/" class="item-brand">
<img src="https://www.newegg.com/" title="MSI" alt="MSI"> ==$0
</a></div>"""
soup = BeautifulSoup(html, features="lxml")
element = soup.select('.item-brand > img:nth-of-type(1)')[0]['title']
print(element)

Python BeautifulSoup - Text Between Div

I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())

Scrape data from HTML pages with sequenced span IDs using Python

I am working with certain HTML pages from which I need to scrape data. The issue is that span ids are numbered.
For example -
ContentPlaceHolder_0, ContentPlaceHolder_1, ContentPlaceHolder_2 ..... ContentPlaceHolder_n
I need to get data from all of these span tags at each page. What would be the best approach to get this data using Beautiful Soup?
You can try CSS selectors built-in within BeautifulSoup. This will select all span whose ids are beginning with ContentPlaceHolder:
soup.select('span[id^=ContentPlaceHolder]')
Example:
from bs4 import BeautifulSoup
html = """<span id='ContentPlaceHolder_0'>0</span>
<span id='ContentPlaceHolder_1'>1</span>
<span id='ContentPlaceHolder_2'>2</span>
<span id='ContentPlaceHolder_3'>3</span>
<span id='xxx'>xxx</span>"""
soup = BeautifulSoup(html, 'lxml')
for s in soup.select('span[id^=ContentPlaceHolder]'):
print(s.text)
Prints:
0
1
2
3

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.
The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:
weather_today = soup.find("div", {"id": "weather_today_content"})
thus my script currently returns the following:
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.
I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.
Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.
from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])
Outputs:
/database/img/weather_today.jpg?ver=2018-08-01
You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):
data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('div#weather_today_content img')['src'])
Prints:
/database/img/weather_today.jpg?ver=2018-08-01
The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

I'm trying to collect the text with BeautifulSoup using python

I want to know how I can collect the desire data with beautiful soup here is the code and trying to collect the text data that is "RoSharon1977"
I'm trying using
<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>
You have to find the div by its id, then get the next ul element, etc and continuing to drill down until you reach the a element, then get the text of it:
from bs4 import BeautifulSoup
html = '''<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>'''
soup = BeautifulSoup(html)
print soup.find('div', attrs={'id': 'twitter-view'}).findNext('ul').findNext('li').findNext('a').text
Or depending on how the whole webpage looks you could simply do:
soup = BeautifulSoup(html)
print soup.find('a').text
And if there are multiple a elements:
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
print a.text

Categories

Resources