Extract html with no span class attribute and same div class attributes

Extract html with no span class attribute and same div class attributes - python

I have found similar questions but none that directly address my issue. I have worked on this for about a week now with no luck.
I am trying to scrape data from this link: https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070
The issue is, the value I am looking for has no span-class attribute but when using the div class attributes, it shares the same name as other values on the page. I want my code to return $22,807 but anything I try either returns $25,195 or []. See the following HTML:
<div class="text-right col-3 col-sm-4 col-md-6">
<div class="label-block label-block-1 label-block-sm-2 text-muted" data-qa="vehicle-header-msrp"
data-test="vehicleHeaderMsrp">
<div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
<div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
<div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
<span class="pricing-block-amount-strikethrough">$25,195</span>
</div>
</div>
</div>
<div class="text-right col-3 col-sm-4 col-md-6">
<div class="label-block label-block-1 label-block-sm-2" data-qa="vehicle-header-average-market-price"
data-test="vehicleHeaderAverageMarketPrice">
<div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
<div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
<div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
<span class="">$22,807</span>
</div>
</div>
</div>
I can easily get the $25,195 returned with the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:70.0) Gecko/20190101 Firefox/70.0"
}
url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
print(url)
page = requests.get(
url,
headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find('span', {'class': 'pricing-block-amount-strikethrough'})
print(test.get_text())
But no combination of calls that I try will return the $22,807 that I need.
What's interesting is that I can get the $25 value if I use
test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2 text-muted'})
So I assumed that I could simply delete the "text-muted" part like:
test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2'})
to get the $22 number but it just returns [ ].
Disclaimer: the dollar amount that I need changes frequently so if you help with this and end up getting a number slightly different than $22,807 it might still be correct. If you click on the link, the number I am looking for is the "Market Average" not the "MSRP."
Thank you!

If you browse the page it takes time for it to get the second value that you are looking for. In requests module it quickly gets the content doesn't wait for it to load completely. This is where you add selenium with bs4. To add the wait for the site to load then get the page content.
you can download the geckodriver from link
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
time.sleep(7)
soup = BeautifulSoup(driver.page_source, 'html')
div = soup.find_all('div', {'class': 'label-block-text'})
for x in div:
span = x.find('span')
print(span.get_text())

Related

How to scrape multiple href values?

Hello, I want to pull the links from this page. All the knowledge in that field comes in according to my own methods. But I just need the links. How can I scrape links?(Pyhton-Beautifulsoup)
make_list = base_soup.findAll('div', {'a class': 'link--muted no--text--decoration result-item'})
one_make = make_list.findAll('href')
print(one_make)
The structure to extract the data is as follows:
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top"> == $0
<a class="link--muted no--text--decoration result-item" href="https://link structure"
Every single link I want to collect is here.(link structure)
I tried methods like.Thank you very much in advance for your help.

Note: In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Iterate your ResultSet and extract the value of href attribute:
make_list = soup.find_all('a', {'class': 'link--muted no--text--decoration result-item'})
for e in make_list:
print(e.get('href'))
Example
from bs4 import BeautifulSoup
html='''
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top">
<a class="link--muted no--text--decoration result-item" href="https://link structure"></a>
</div>
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top">
<a class="link--muted no--text--decoration result-item" href="https://link structure"></a>
</div>
'''
soup = BeautifulSoup(html)
make_list = soup.find_all('a', {'class': 'link--muted no--text--decoration result-item'})
for e in make_list:
print(e.get('href'))

This is an example of code on how you can achieve that
from bs4 import BeautifulSoup
html = '''
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top"> == $0
<a class="link--muted no--text--decoration result-item" href="https://link structure"></a>
</div>
<div class="cBox-body cBox-body--eyeCatcher" data-testid="no-top"> == $0
<a class="link--muted no--text--decoration result-item" href="https://link example.2"></a>
</div>
'''
soup = BeautifulSoup(html, features="lxml")
anchors = soup.find_all('a')
for anchor in anchors:
print(anchor['href'])
Alternatively, you can use a third-party service such as WebScrapingAPI to achieve your goal. I recommend this service since because it is beginner friendly and it offers CSS extracting and many advanced features such as IP rotations, rendering javascript, CAPTCHA solving, custom geolocation and many more which you can find out about by checking the docs. This in an example of how you can get links from a webpage using WebScrapingAPI:
from bs4 import BeautifulSoup
import requests
import json
API_KEY = '<YOUR-API-KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://docs.webscrapingapi.com/'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"extract_rules": '{"linksList": {"selector": "a[href]", "output": "html", "all": 1 }}',
}
response = requests.get(SCRAPER_URL, params=PARAMS)
parsed_result = json.loads(response.text)
linksList = parsed_result['linksList']
for link in linksList:
soup = BeautifulSoup(link, features='lxml')
print(soup.find('a').get('href'))
If you are interested you can check more information about this on our Extraction Rules Docs

How to scrape for <span title>?

I have been trying to scrape indeed.com and when doing so I ran into a problem. When scraping for the titles of the positions on some results i get 'new' because there is a span before the position name labeled as 'new'. I have tried researching and trying different things i still havent got no where. So i come for help. The position names live within the span title tags but when i scrape for 'span' in some cases i obviously get the 'new' first because it grabs the first span it sees. I have tried to exclude it several ways but havent had any luck.
Indeed Source Code:
<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class = "label">new</span>
</div>
<span title="Freight Stocker"> Freight Stocker </span>
</h2>
</div>
Code I Tried:
import requests
from bs4 import BeautifulSoup
def extract(page):
headers = {''}
url = f'https://www.indeed.com/jobs?l=Bakersfield%2C%20CA&start={page}&vjk=42cee666fbd2fae9'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'heading4 color-text-primary singleLineTitle tapItem-gutter')
for item in divs:
res = item.find('span').text
print(res)
return
c=extract(0)
transform(c)
Results:
new
Hourly Warehouse Ope
Immediate FT/PT Open
Service Cashier/Rece
new
Cannabis Sales Repreresentative
new
new
new
new
new

You can use a CSS selector .resultContent span[title], which will select all <span> that have a title attribute within the class resultContent.
To use a CSS selector, use the select() method instead of .find():
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select(".resultContent span[title]"):
print(tag.text)

BeautifulSoup - how to call on a nested element

I just need a little help finding an element in my python script with Beautiful Soup.
Below is the html:
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
What I am trying to do is get the product price, and looking at the html above, it looks like it is found within this section from the html above (price is £3.49):
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
My issue is that even though I use Beautiful Soup to try and get the price like so:
pound = soup.find('span',attrs={'class':'pound'})
pence = soup.find('span',attrs={'class':'pence'})
prices.append(pound.text + pence.text)
I get this exception stating:
prices.append(pound.text + pence.text)
AttributeError: 'NoneType' object has no attribute 'text'
So to me it looks like it's returning a None or null. Does anybody have an idea on how I can get to the element?
EDIT
Looking at the answers below, I tried to replicate them but instead of using a static HTML, I call on the website url. What I noticed is that even though the code works for a static html, it doesn't work when I call on the url that contains the page that contains that html.
CODE:
from bs4 import BeautifulSoup
import pandas as pd
import requests
data = requests.get('https://www.screwfix.com/p/no-nonsense-sanitary-silicone-white-310ml/47187').text
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49
Error:
currency_str = next(currency.strings).strip()
AttributeError: 'NoneType' object has no attribute 'strings'

I have taken your data as html so what approach you can follow get the text with in that div and use strip to remove unnecessary data now if you see main_div contain some letters so remove it by using re and you finally get your desired output
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,"html.parser")
main_div=soup.find("div",attrs={"class":"pr__price"}).get_text(strip=True)
lst=re.findall("\d+", main_div)
print(".".join(lst[:2]))
Output:
3.49

Here's a different approach.
from bs4 import BeautifulSoup
data = '''\
<div class="lg-7 lg-offset-1 md-24 sm-24 cols">
<div class="row pr__prices">
<div class="lg-24 md-12 cols">
<input id="analytics_prodPrice_47187" type="hidden" value="2.91">
<div class="pr__pricepoint">
<div id="product_price" class="pr__price">
<span class="pound">
£</span>3<span class="pence">.49<span class="incvat">INC VAT</span></span>
<span class="price__extra">(<span id="unit_price">£11.26</span>/<span id="unit_price_measure">Ltr</span>)</span>
</div>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
currency = soup.select_one('span.pound')
currency_str = next(currency.strings).strip()
pound_str = currency.nextSibling
pence = soup.select_one('span.pence')
pence_str = next(pence.strings).strip()
print(f"{currency_str}{pound_str}{pence_str}") # £3.49

Issue with python web scraping when scraping price from a product

So I have this code. I successfully extract each product name of the page.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://www.tenniswarehouse-europe.com/catpage-WILSONRACS-EN.html"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
containers = page_soup.findAll("div", {"class":"product_wrapper cf rac"})
for container in containers:
name = container.div.img["alt"]
print(name)
And Im trying to extract the prices from the html below. I tried the same approach as above but faced an error saying index out of range. I also tried to div where the price is and even the span but to no avail.
<div class="product_wrapper cf rac">
<div class="image_wrap">
<a href="https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html">
<img class="cell_rac_img" src="https://img.tenniswarehouse-europe.com/cache/56/97V13R-thumb.jpg" srcset="https://img.tenniswarehouse-europe.com/cache/112/97V13R-thumb.jpg 2x" alt="Wilson Pro Staff RF 97 V13.0 Racket" />
</a>
</div>
<div class="text_wrap">
<a class="name " href="https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html">Wilson Pro Staff RF 97 V13.0 Racket</a>
<div class="pricing">
<span class="price"><span class="convert_price">264,89 €</span></span>
<span class="msrp">SRP <span class="convert_price">300,00 €</span></span>
</div>
<div class="pricebreaks">
<span class="pricebreak">Price for 2: <span class="convert_price">242,90 €</span> each</span>
</div>
<div>
<p>Wilson updates the cosmetic of Federer's RF97 but keeps the perfect spec profile and sublime feel that has come to define this iconic racket. Headsize: 626cm². String Pattern: 16x19. Standard Length</p>
<div class="cf">
<div class="feature_links cf">
<a class="review ga_event" href="/Reviews/97V13R/97V13Rreview.html" data-trackcategory="Product Info" data-trackaction="TWE Product Review" data-tracklabel="97V13R - Wilson Pro Staff RF 97 V13.0 Racket">TW Reviews</a>
<a class="feedback ga_event" href="/feedback.html?pcode=97V13R" data-trackcategory="Product Info" data-trackaction="TWE Customer Review" data-tracklabel="97V13R - productName">Customer Reviews</a>
<a class="video_popup ga_event" href="/productvideo.html?pcode=97V13R" data-trackcategory="Video" data-trackaction="Cat - Product Review" data-tracklabel="Wilson_Pro_Staff_RF_97_V130_Racket">Video</a>
</div>
</div>
</div>
</div>
</div>
</td>
<td class="cat_border_cell">
<div class="product_wrapper cf rac">

I guess this will work for you:
prices = page_soup.findAll("span", {"class":"convert_price"})
Then you'll have a container with all prices on the page, you can access single prices with prices[0] ... prices[len(prices)-1].
If you want to remove the html tags from the prices do prices[0].text
But where is this HTML exactly from? Bc the prices aren't on the page of the link you souped in your code. So in this soup you shouldn't find any prices.
The above code works for the html code you provided there.
Edit: Screenshot for the comment below
!SOLUTION!:
A way to solve this issue is by using Selenium webdriver together with BeautifulSoup. I can't seem to find any other (easier) way.
First, install Selenium with pip install selenium
Download the driver for your browser here.
What we do is we click the "Set Selections" button which appears when opening the website, then we soup the page with the prices already loaded in. Enjoy my code below.
from bs4 import BeautifulSoup
from selenium import webdriver
# use the path of your driver.exe
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
# open your website link
driver.get("https://www.tenniswarehouse-europe.com/catpage-WILSONRACS-EN.html")
# button for submitting the location
button1 = driver.find_element_by_class_name("vat_entry_opt-submit")
button1.click()
# now that the button is clicked the prices are loaded in and we can soup this page
html = driver.page_source
page_soup = BeautifulSoup(html)
# extracting all prices into an array named pricing
pricing = page_soup.findAll("div",{"class":"pricing"})
price = pricing[x].span.text
# a loop for writing every price inside an array named 'price'
price = []
i = 0
while i<len(pricing):
price.append(pricing[i].span.text)
i = i + 1
# For this example you have to use class "pricing" instead of "price" because the red prices are in class "sale"
# replace x with the price you're looking for, or let it loop and get all prices in one array
# driver.close() closes your webdriver window

Would this help:
In [244]: soup = BeautifulSoup(requests.get('https://www.tenniswarehouse-europe.com/Wilson_Pro_Staff_RF_97_V130_Racket/descpageRCWILSON-97V13R-EN.html').content, 'html.parser')
In [245]: soup.find('h1', class_='name').text.strip()
Out[245]: 'Wilson Pro Staff RF 97 V13.0 Racket'
In [246]: soup.find(class_='convert_price').text.strip()
Out[246]: '242,90 €'

Cannot pull data from pantip.com

I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.

The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract html with no span class attribute and same div class attributes - python

Related

How to scrape multiple href values?

How to scrape for <span title>?

BeautifulSoup - how to call on a nested element

Issue with python web scraping when scraping price from a product

Cannot pull data from pantip.com

Categories

Resources