Beautifulsoup find_all() captures too much text - python

I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?

Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]

from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>

What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.

Related

How can I print all links within a found elements?

I'm new to BeautifulSoup, I found all the cards, about 12. But when I'm trying to loop through each card and print link href. I kept getting this error
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
# print(cards)
print(len(cards))
for link in cards.find_all('a'):
print(link.get('href'))
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
Will return a collection of all the div's found, you'll need to loop over them before finding the chil a's.
That said, you should probably use findChildren for finding the a elements.
Example Demo with an minimal piece of HTML
from bs4 import BeautifulSoup
html = """
<div class='up-card-section'>
<div class='foo'>
<a href='example.com'>FooBar</a>
</div>
</div>
<div class='up-card-section'>
<div class='foo'>
<a href='example2.com'>FooBar</a>
</div>
</div>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for card in soup.findAll('div', attrs={'class': 'up-card-section'}):
for link in card.findChildren('a', recursive=True):
print(link.get('href'))
Output:
example.com
example2.com

How to scrape for <span title>?

I have been trying to scrape indeed.com and when doing so I ran into a problem. When scraping for the titles of the positions on some results i get 'new' because there is a span before the position name labeled as 'new'. I have tried researching and trying different things i still havent got no where. So i come for help. The position names live within the span title tags but when i scrape for 'span' in some cases i obviously get the 'new' first because it grabs the first span it sees. I have tried to exclude it several ways but havent had any luck.
Indeed Source Code:
<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class = "label">new</span>
</div>
<span title="Freight Stocker"> Freight Stocker </span>
</h2>
</div>
Code I Tried:
import requests
from bs4 import BeautifulSoup
def extract(page):
headers = {''}
url = f'https://www.indeed.com/jobs?l=Bakersfield%2C%20CA&start={page}&vjk=42cee666fbd2fae9'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'heading4 color-text-primary singleLineTitle tapItem-gutter')
for item in divs:
res = item.find('span').text
print(res)
return
c=extract(0)
transform(c)
Results:
new
Hourly Warehouse Ope
Immediate FT/PT Open
Service Cashier/Rece
new
Cannabis Sales Repreresentative
new
new
new
new
new
You can use a CSS selector .resultContent span[title], which will select all <span> that have a title attribute within the class resultContent.
To use a CSS selector, use the select() method instead of .find():
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select(".resultContent span[title]"):
print(tag.text)

BeautifulSoup - how to call on a nested element

I just need a little help finding an element in my python script with Beautiful Soup.
Below is the html:
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
What I am trying to do is get the product price, and looking at the html above, it looks like it is found within this section from the html above (price is £3.49):
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
My issue is that even though I use Beautiful Soup to try and get the price like so:
pound = soup.find('span',attrs={'class':'pound'})
pence = soup.find('span',attrs={'class':'pence'})
prices.append(pound.text + pence.text)
I get this exception stating:
prices.append(pound.text + pence.text)
AttributeError: 'NoneType' object has no attribute 'text'
So to me it looks like it's returning a None or null. Does anybody have an idea on how I can get to the element?
EDIT
Looking at the answers below, I tried to replicate them but instead of using a static HTML, I call on the website url. What I noticed is that even though the code works for a static html, it doesn't work when I call on the url that contains the page that contains that html.
CODE:
from bs4 import BeautifulSoup
import pandas as pd
import requests
data = requests.get('https://www.screwfix.com/p/no-nonsense-sanitary-silicone-white-310ml/47187').text
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49
Error:
currency_str = next(currency.strings).strip()
AttributeError: 'NoneType' object has no attribute 'strings'
I have taken your data as html so what approach you can follow get the text with in that div and use strip to remove unnecessary data now if you see main_div contain some letters so remove it by using re and you finally get your desired output
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,"html.parser")
main_div=soup.find("div",attrs={"class":"pr__price"}).get_text(strip=True)
lst=re.findall("\d+", main_div)
print(".".join(lst[:2]))
Output:
3.49
Here's a different approach.
from bs4 import BeautifulSoup
data = '''\
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49

How to extract the href attribute value from an a tag with beautiful soup?

This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])
BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm

How to extract data form the below HTML code using beautifulsoup?

I want to extract data from the div with class 'cinema' and 'timings' using BeautifulSoup in python3 . How can i do it using soup.findAll ?
<div data-order="0" class="cinema">
<div class="__name">SRS Shoppers Pride Mall<span class="__venue"> - Bijnor</span>
</div>
<div class="timings"><span class="__time _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22876','ET00015438','01:30 PM');">01:30 PM</span><span class="__time _center _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22877','ET00015438','04:00 PM');">04:00 PM</span><span class="__time _right _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22878','ET00015438','06:30 PM');">06:30 PM</span><span class="__time _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22879','ET00015438','09:00 PM');">09:00 PM</span>
</div>
</div>
This is my code:
for div in soup.findAll('div',{'class':'cinema'}):
print div.text # It printed nothing ,the program just ended
You can specify both classes in findAll:
soup.findAll(True, {'class': ['cinema', 'timings']})
The "div" you are interested in is another "div" child. To get that "div" you can use the .select method.
from bs4 import BeautifulSoup
html = <your html>
soup = BeautifulSoup(html, 'lxml')
for div in soup.select('div.cinema > div.timings'):
print(div.get_text(strip=True))
Or iterate the find_all() result and use the .find() method to return those "div" where class: "timings"
for div in soup.find_all('div', class_='cinema'):
timings = div.find('div', class_='timings')
print(timings.get_text(strip=True))

Categories

Resources