BeautifulSoup couldn't find everything - python

I'm trying to scrape some data out of a web page, the data that I want to scrape is set like this:
<div id="pagetitle">
some_text
"some_text2"
some_text3
</div>
and I'm trying to get some_text3 I'm trying with this code
soup = soup(page, "html5lib")
author = soup.find('div', {'id' : 'pagetitle'}).a.string
print(author)
when I do this I only get some_text I also tried with:
author = soup.find_all('a', {'id' : 'pagetitle'})
but I get an empty list, I also tried it with:
author = soup.find(id='pagetitle').prettify()
and I get the whole code but I don't know how to get only some_text3
I also tried to use different parsers but none of them worked
also sorry if this is hard to understand but It's my second question here, I would kindly accept all recommendations if there are.

You can use CSS selector with :nth-last-child(). For example:
from bs4 import BeautifulSoup
html_doc = """
<div id="pagetitle">
some_text
"some_text2"
some_text3
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one("#pagetitle > a:nth-last-child(1)").text
print(txt)
Prints:
some_text3
Or: use [-1] to get last element:
txt = soup.select("#pagetitle a")[-1].text
print(txt)

Related

How do I specify which a tag I need when scraping in Python?

I am using BeautifulSoup...
When I run this code:
inside_branding_info = container.div.find("div", "item-branding")
print(inside_branding_info)
It returns:
div class="item-branding">
<a class="item-rating" href="https://www.newegg.com/gigabyte-geforce-rtx-2060-super-gv-n206swf2oc-8gd/p/N82E16814932174?cm_sp=SearchSuccess-_-INFOCARD-_-graphics+cards-_-14-932-174-_-1&Description=graphics+cards&IsFeedbackTab=true#scrollFullInfo"><i class="rating rating-4"></i><span class="item-rating-num">(12)</span></a>
</div>
However, in the HTML inspection this is what I see:
Raw Site HTML
Everytime I run:
inside_branding_info.a.img["title"]
...python thinks I want the "a" tag "item-rating"...not the "a" href tag nested inside of the div "item-branding".
How do I get inside of the "a href" tag, then into the "img", to finally extract the "title" (title = "MSI")? I want the title/brand of the item on the website. I am new to Python. I have only used R and SQL before this instance, any help would be greatly appreciated.
You need a selector path .
Accroding to the img you provided...
soup = BeautifulSoup(data)
img = soup.select('.item-brand > img')
print(img['title'])
The above should work for you.
Try the following
from bs4 import BeautifulSoup
html = """<div class="item-branding">
<a href="https://www.newegg.com/" class="item-brand">
<img src="https://www.newegg.com/" title="MSI" alt="MSI"> ==$0
</a></div>"""
soup = BeautifulSoup(html, features="lxml")
element = soup.select('.item-brand > img:nth-of-type(1)')[0]['title']
print(element)

How to get data from nested HTML using BeautifulSoup in Django

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)
This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

Python BeautifulSoup - Text Between Div

I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())

BeautifulSoup and remove entire tag

I'm working with BeautifulSoup. I wish that if I see the tag -a href- the entire line is deleted, but, actually, not.
By example, if I have :
<a href="/psf-landing/">
This is a test message
</a>
Actually, I can have :
<a>
This is a test message
</a>
So, how can I just get :
This is a test message
Here is my code :
soup = BeautifulSoup(content_driver, "html.parser")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
for titles in soup.findAll('a'):
del titles['href']
tree = soup.prettify()
Try to use .extract() method. In your case, you're just deleting an attribute
for titles in soup.findAll('a'):
if titles['href'] is not None:
titles.extract()
Here,you can see the detailed examples Dzone NLP examples
what you need is :
text = soup.get_text(strip=True)
This is the sample example:
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
You are looking for the unwrap() method. Have a look at the following snippet:
html = '''
<a href="/psf-landing/">
This is a test message
</a>'''
soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('a', href=True):
el.unwrap()
print(soup)
# This is a test message
Using href=True will match only the tags that have href as an attribute.

I'm trying to collect the text with BeautifulSoup using python

I want to know how I can collect the desire data with beautiful soup here is the code and trying to collect the text data that is "RoSharon1977"
I'm trying using
<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>
You have to find the div by its id, then get the next ul element, etc and continuing to drill down until you reach the a element, then get the text of it:
from bs4 import BeautifulSoup
html = '''<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>'''
soup = BeautifulSoup(html)
print soup.find('div', attrs={'id': 'twitter-view'}).findNext('ul').findNext('li').findNext('a').text
Or depending on how the whole webpage looks you could simply do:
soup = BeautifulSoup(html)
print soup.find('a').text
And if there are multiple a elements:
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
print a.text

Categories

Resources