So I'm new to parsing HTML with Python, and I want to get the price of this lumber from the following link:
https://www.lowes.com/pd/2-in-x-4-in-x-8-ft-Whitewood-Stud-Common-1-5-in-x-3-5-in-x-96-in-Actual/1000074211
This is what I have so far, but I'm getting an error that says "AttributeError: 'NoneType' object has no attribute 'text'" :
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/91.0.4472.164 Safari/537.36"}
URL = "https://www.lowes.com/pd/2-in-x-4-in-x-8-ft-Whitewood-Stud-Common-1-5-in-x-3-5-in-x-96-
in-Actual/1000074211"
r = requests.get(URL,headers=HEADERS)
c=r.content
soup = BeautifulSoup(c, "html.parser")
price_box = soup.find("div", class_="sq-bqyKva.ehfErk")
price=price_box.text.strip()
print(price)
Why am I getting this error and how can I fix it?
I can't see the site, but most probably you're getting the error because BS cannot find any element with a class called "sq-bqyKva.ehfErk".
Can you print out the soup and search for the class manually to see that it actually exists?
Also, based on the class name it looks like the div you are trying to find is dynamically generated using JavaScript which means it is not loaded into DOM when the request is made, which means BS won't be able to find it. If this is the case you might want to look into using other tools such as Selenium.
The site is dynamic, relying on a script to populate the page with data from a JSON string, which can be found in a script tag and parsed via the json module to access the price:
from bs4 import BeautifulSoup as soup
import json
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36"}
d = soup(requests.get('https://www.lowes.com/pd/2-in-x-4-in-x-8-ft-Whitewood-Stud-Common-1-5-in-x-3-5-in-x-96-in-Actual/1000074211', headers=headers).text, 'html.parser')
data = json.loads(d.select_one('script[data-react-helmet="true"][type="application/ld+json"]').contents[0])
price = data[2]['offers']['price']
Output:
5.98
This is a really difficult site. You will not be able to access the price using that class name. Web scraping becomes difficult if your site has javascript DOM elements. "View Source" will get you the response that Beautifulsoup will get. On closer inspection, your page price is inside a json value under a "script" tag:
import requests
from bs4 import BeautifulSoup
import json
HEADERS = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/91.0.4472.164 Safari/537.36"}
URL = "https://www.lowes.com/pd/2-in-x-4-in-x-8-ft-Whitewood-Stud-Common-1-5-in-x-3-5-in-x-96-in-Actual/1000074211"
r = requests.get(URL,headers=HEADERS)
soup = BeautifulSoup(r.content, "html.parser")
contents = soup.find("script")
json_object = json.loads(contents.text)
print(json_object[2]["offers"]["price"])
Related
I'm trying to scrape sitemap from a site using beautifulsoup but I'm facing huge problem. There is my code, the error is
"TypeError: 'NoneType' object is not subscriptable"
Here is my code
import requests
from bs4 import BeautifulSoup as bs
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
url = "https://www.celebheights.com/"
res= requests.get(url, headers=headers)
html = bs(res.text, 'html.parser')
lilink = html.findAll('li')
for li in lilink:
alink = li.find('a')['href']
print(alink)
How can I solve this problem?
You could use print() to see what you have in variables in line which make problem.
This page has some <li> without <a> and this makes problem.
You have to check what you have in alink because sometimes it is None.
for li in lilink:
alink = li.find('a')
if alink:
url = alink['href']
print(url)
else:
print('<li> without <a>:', li)
Result:
https://www.celebheights.com/
https://www.celebheights.com/comments.html
https://www.celebheights.com/s/latest_1.html
https://www.celebheights.com/s/compare.php
https://www.celebheights.com/s/top50.html
https://www.youtube.com/user/robpaul
<li> without <a>: <li id="ilsook"></li>
https://www.celebheights.com/s/latest_1.html
https://www.celebheights.com/s/Sean-Kanan-52921.html
https://www.celebheights.com/s/Michael-Parks-52920.html
https://www.celebheights.com/s/Harlan-Drum-52919.html
https://www.celebheights.com/s/Patricia-Medina-52918.html
https://www.celebheights.com/s/Nan-Leslie-52917.html
https://www.celebheights.com/s/Don-Cornelius-52916.html
https://www.celebheights.com/s/Maria-Sten-52915.html
https://www.celebheights.com/s/Bruce-McGill-52914.html
https://www.celebheights.com/comments.html
https://www.celebheights.com/s/compare.php
https://www.celebheights.com/s/top50.html
https://www.celebheights.com/s/Justin-Bieber-47348.html
https://www.celebheights.com/s/Tom-Cruise-3.html
https://www.celebheights.com/s/Brad-Pitt-371.html
https://www.celebheights.com/s/Arnold-Schwarzenegger-177.html
https://www.celebheights.com/s/Sylvester-Stallone-347.html
https://www.celebheights.com/sneakers/
https://www.celebheights.com/a/23.html
https://www.celebheights.com/a/
https://www.celebheights.com/s/tagsA.html
I'm unable to find the ratings (number next to the stars) at rakuten website the pic showed below.
I try to use beautifulsoup to locate the element, but it doesn't work.
import time
import requests
!pip install beautifulsoup4
import bs4
!pip install lxml
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
products =[]
for i in range(1,2): # Iterate from page 1 to the last page
url = "https://www.rakuten.com.tw/shop/pandq/product/?l-id=tw_shop_inshop_cat&p={}".format(i)
r = requests.get(url, headers = headers)
soup = bs4.BeautifulSoup(r.text,"lxml")
Soup = soup.find_all("div",class_='b-mod-item-vertical products-grid-section')
for product in Soup:
productcount = product.find_all("div",class_='b-content')
print(productcount)
What happens?
Selection of element is not that proper, so you wont get the expected result.
How to fix?
As your Screen shot shows different things price / rating I will focus on rating.
First select all the items:
soup.select('.b-item')
Then iterate the result set and select the <a> that holds the rating:
item.select_one('.product-review')
Get rid of all the special characters:
item.select_one('.product-review').get_text(strip=True).strip('(|)')
Example
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get('https://www.rakuten.com.tw/shop/pandq/product/?l-id=tw_shop_inshop_cat&p=1',headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
for item in soup.select('.b-item'):
rating = item.select_one('.product-review').get_text(strip=True).strip('(|)') if item.select_one('.product-review') else None
print(rating)
Output
5
36
21
32
8
...
I have got this webpage https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html
and I need to scrape the second last row from the large table.
In other words, I need to get this (Ένδικα Μέσα -) from the table.
This is my progress so far
from bs4 import BeautifulSoup as soup
import requests
import csv
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup1 = BeautifulSoup(page.content,"html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
soup3 = soup2.find('td', text = "Ένδικα Μέσα")
print(soup3)
Thank you very much
Thank you very much, it works like a charm
You near to a solution - Clean up you soups and try to get the parent of your result, this will give you the whole tr:
soup.find('td', text = "Ένδικα Μέσα").parent.get_text(strip=True)
or find_next('td) to access the text of its neighbour:
soup.find('td', text = "Ένδικα Μέσα").find_next('td').text
Example
from bs4 import BeautifulSoup
import requests
import csv
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
row = soup.find('td', text = "Ένδικα Μέσα").parent.get_text(strip=True)
print(row)
Output
Eνδικα Μέσα -
You can use the selector for that field. There's a easy way to copy the selector for a element using the inspector of your browser and clicking the html tag that you want in copy > Copy Selector.
With beautiful soup you can use the soup.select(selector). The documentation describes this better.
I need your help to get "Description" content of this URL using BeautifulSoup in Python (as shown below).
I have tried below code but it return None only!
import requests as rq
from bs4 import BeautifulSoup
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
page = rq.get(url, headers=hdr)
soup = BeautifulSoup(page.content, "html.parser")
description = soup.find('div', {'class': 'force-wrapping ng-star-inserted'})
I had tried and i saw that soup doesn't has class force-wrapping ng-star-inserted because you had taken the source of site. It is different from what you saw in dev tool, to see source of site, you can press Ctr+U. Then you can see that the description is in meta tag with name is description. So, what you need to do is find this tag and take the content. For Sample:
res = soup.find('meta', {"name":"description"})
print(res['content'])
I want to scrape the information from AMD stock that google provide.
I have been able to scrape the whole webpage, but as soon I try to get a specific div or class I am not able to find anything and the console returns [].
When scraping the whole page I cannot find those classes either, after searching I found that this is possibly hidden by Javascript and can somehow be accesed with Selenium? I tried to use Selenium Webdriver but this got me nowhere.
This is what i have so far:
import requests
from bs4 import BeautifulSoup
import urllib3
from selenium import webdriver
requests.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}
url = "https://www.google.com/search?q=amd+stock&oq=amd+stock&aqs=chrome..69i57j35i39j0l5j69i60.1017j0j7&sourceid=chrome&ie=UTF-8"
source_code = requests.get(url, requests.headers)
soup = BeautifulSoup(source_code.text, "html.parser")
amd = soup.find_all('div', attrs = {'class': 'aviV4d'})
print(amd)
When printing 'soup' I get the whole page, but when printing 'amd' I get [].
Your code is ok, but use headers= parameter in request() call:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"}
url = "https://www.google.com/search?q=amd+stock&oq=amd+stock&aqs=chrome..69i57j35i39j0l5j69i60.1017j0j7&sourceid=chrome&ie=UTF-8"
source_code = requests.get(url, headers=headers)
soup = BeautifulSoup(source_code.text, "html.parser")
amd = soup.find('div', attrs = {'class': 'aviV4d'})
print(amd.get_text(strip=True, separator='|').split('|')[:3])
Prints:
['Advanced Micro Devices', 'NASDAQ: AMD', '48,16']
That's a dynamic page, it's not gonna give the stock price by just requesting the page source via requests. You will have to use scraping to do that. Try this instead:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--incognito")
chromedriver_path = './chromedriver'
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
driver.get("https://www.google.com/search?q=amd+stock&oq=amd+stock&aqs=chrome..69i57j35i39j0l5j69i60.1017j0j7&sourceid=chrome&ie=UTF-8")
time.sleep(2)
x = driver.find_element_by_xpath('//*[#id="knowledge-finance-wholepage__entity-summary"]/div/g-card-section/div/g-card-section/span[1]/span/span[1]')
print(x.text)
driver.quit()
Output:
48.16
I believe you need to add amd.response or amd.text
print(amd.response)
print(amd.text)