I have a problem on scraping an e-commerce site using BeautifulSoup. I did some Googling but I still can't solve the problem.
Please refer on the pictures:
1 Chrome F12 :
2 Result :
Here is the site that I tried to scrape: "https://shopee.com.my/search?keyword=h370m"
Problem:
When I tried to open up Inspect Element on Google Chrome (F12), I can see the for the product's name, price, etc. But when I run my python program, I could not get the same code and tag in the python result. After some googling, I found out that this website used AJAX query to get the data.
Anyone can help me on the best methods to get these product's data by scraping an AJAX site? I would like to display the data in a table form.
My code:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://shopee.com.my/search?keyword=h370m')
soup = BeautifulSoup(source.text, 'html.parser')
print(soup)
Welcome to StackOverflow! You can inspect where the ajax request is being sent to and replicate that.
In this case the request goes to this api url. You can then use requests to perform a similar request. Notice however that this api endpoint requires a correct UserAgent header. You can use a package like fake-useragent or just hardcode a string for the agent.
import requests
# fake useragent
from fake_useragent import UserAgent
user_agent = UserAgent().chrome
# or hardcode
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36'
url = 'https://shopee.com.my/api/v2/search_items/?by=relevancy&keyword=h370m&limit=50&newest=0&order=desc&page_type=search'
resp = requests.get(url, headers={
'User-Agent': user_agent
})
data = resp.json()
products = data.get('items')
Welcome to StackOverflow! :)
As an alternative, you can check Selenium
See example usage from documentation:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
When you use requests (or libraries like Scrapy) usually JavaScript not loaded. As #dmitrybelyakov mentioned you can reply these calls or imitate normal user interaction using Selenium.
Related
While scraping the following website (https://www.middletownk12.org/Page/4113), this code could not locate the table rows (To get the staff name, email & department) even though they are visible when I use the Chrome developer tools. The soup object is not readbale enough to locate the tr tags that have the info needed.
import requests
from bs4 import BeautifulSoup
url = "https://www.middletownk12.org/Page/4113"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(response.text)
I used different libraries such as bs4, request & selenium with no chance. I also tried Css selectors & XPATH with selenium with no chance. The Tr elements could not be located.
That table of contact information is filled in by Javascript after the page has loaded. The content doesn't exist in the page's HTML and you won't see it using requests.
By using the developer tools available in the browser, we can examine the requests made after the page has loaded. There are a lot of them, but at least in my browser it's obvious the contact information is loaded near the end.
Looking at the request log, I see a request for a spreadsheet from docs.google.com:
If we examine that entry, we find that it's a request for:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0
And if we fetch the above link, we get a spreadsheet with the source data for that table.
Actually I used Selenium & then bs4 without any results. The code does not find the 'tr' elements...
Why are you using Selenium? The whole point to this answer is that you don't need to use Selenium if you can figure out the link to retrieve the data -- which we have.
All we need is requests to fetch the data and BeautifulSoup to parse it:
import requests
import bs4
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for link in soup.findAll('a'):
print(f"{link.text}: {link.get('href')}")
I am trying to learn web scraping on juypter notebook with python but I'm getting the following error message
AttributeError: 'NoneType' object has no attribute 'get_text'
What am I doing wrong?
# CONNECT TO WEBSITE
# Connect to Website and pull in data
URL = 'https://www.amazon.com/Funny-Data-Systems-Business-Analyst/dp/B07FNW9FGJ/ref=sr_1_3?dchild=1&keywords=data%2Banalyst%2Btshirt&qid=1626655184&sr=8-3&customId=B0752XJYNL&th=1'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.50"}
page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
title = soup2.find(id='productTitle').get_text()
price = soup2.find(id='priceblock_ourprice').get_text()
print(title)
print(price)
There are some issues in this case. The error tells you that no element with that attribute could be found on the page. I tried to go manually to the URL and search for the price element by the id of priceblock_ourprice and I could not find it, but this can be caused by several causes. Make sure to get the correct classes by using inspect element.
The main issue is it seems like the website you are scraping is dynamically loading data. Hence you would have to somehow render the JS before using BS4.
One way you can achieve this is by using Selenium and here you can see an example of solution for your problem:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
driver = webdriver.Chrome(service=Service('/usr/local/bin/chromedriver'))
URL = 'https://www.amazon.com/Funny-Data-Systems-Business-Analyst/dp/B07FNW9FGJ/ref=sr_1_3?dchild=1&keywords=data%2Banalyst%2Btshirt&qid=1626655184&sr=8-3&customId=B0752XJYNL&th=1'
driver.get(URL)
# time.sleep(1)
soup = BeautifulSoup(driver.page_source, "html.parser")
title = soup.select_one("span[id='productTitle']").text
price = soup.select_one("span[class='a-offscreen']").text
print(title)
print(price)
driver.close()
Alternatively, you can use a third-party service such as WebScrapingAPI to achieve your goal. I recommend this service since because it is beginner friendly and it offers CSS extracting. We also have more advanced features such as IP rotations , CAPTCHA solving , geolocation , sticky sessions and many other features . You can learn more about our service by checking out our docs.We also offer special support for Amazon by our Amazon Search API which is designed for problems like yours. This in an example of how your problem would be solved using our service :
import json
import requests
API_KEY = '<YOUR-API-KEY-HERE>'
SCRAPER_URL = 'https://ecom.webscrapingapi.com/v1'
PARAMS = {
"api_key":API_KEY,
"engine":"amazon",
"type":"product",
"product_id":"B09FQ35SW6"
}
response = requests.get(SCRAPER_URL, params=PARAMS)
parsed_result = json.loads(response.text)
title = parsed_result['product_results']['title']
price = parsed_result['product_results']['price']
print(title)
print(price)
That error is caused because soup2.find(id='priceblock_ourprice') is returning None, and None does not have a function get_text(); BeautifulSoup couldn't find an element with id='priceblock_ourprice'.
When I go to the webpage you linked, I can't find an element with the ID priceblock_ourprice. However, this could easily be because I'm using a full browser and thus am requesting the webpage with a different User-Agent header than your script.
I've created a script in python to log in a webpage using credentials and then parse a piece of information SIGN OUT from another link (the script is supposed to get redirected to that link) to make sure I did log in.
Website address
I've tried with:
import requests
from bs4 import BeautifulSoup
url = "https://member.angieslist.com/gateway/platform/v1/session/login"
link = "https://member.angieslist.com/"
payload = {"identifier":"usename","token":"password"}
with requests.Session() as s:
s.post(url,json=payload,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
"Referer":"https://member.angieslist.com/member/login",
"content-type":"application/json"
})
r = s.get(link,headers={"User-Agent":"Mozilla/5.0"},allow_redirects=True)
soup = BeautifulSoup(r.text,"lxml")
login_stat = soup.select_one("button[class*='menu-item--account']").text
print(login_stat)
When i run the above script, I get AttributeError: 'NoneType' object has no attribute 'text' this error which means I went somewhere wrong in my log in process as the information I wish to parse SIGN OUT is a static content.
How can I parse this SIGN OUT information from that webpage?
This website requires JavaScript to work with. Though you generate the login token correctly from the login API, but when you go to the home page, it make multiple additional API calls and then updates the page.
So the issue has nothing to do with login not working. You need to use something like selenium for this
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://member.angieslist.com/member/login")
driver.find_element_by_name("email").send_keys("none#getnada.com")
driver.find_element_by_name("password").send_keys("NUN#123456")
driver.find_element_by_id("login--login-button").click()
import time
time.sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")
login_stat = soup.select("[id*='menu-item']")
for item in login_stat:
print(item.text)
print(login_stat)
driver.quit()
I have mixed bs4 and selenium here to get it easy for you but you can use just selenium as well if you want
I tried to scrape href link for the next page from an html tag using xpath with lxml. But the xpath is returning null list whereas it was tested separately and it seems to work.
I've tried both css selector and xpath, both of them are returning null list.
The code is returning a null value whereas the xpath seems to work fine.
import sys
import time
import urllib.request
import random
from lxml import html
import lxml.html
import csv,os,json
import requests
from time import sleep
from lxml import etree
username = 'username'
password = 'password'
port = port
session_id = random.random()
super_proxy_url = ('http://%s-session-%s:%s#zproxy.lum-superproxy.io:%d' %(username, session_id, password, port))
proxy_handler = urllib.request.ProxyHandler({
'http': super_proxy_url,
'https': super_proxy_url,})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = \[('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
print('Performing request')
page = self.opener.open("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588").read()
pageR = requests.get("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588",headers={"User-Agent":"Mozilla/5.0"})
doc=html.fromstring(str(pageR))
html = lxml.html.fromstring(str(page))
links = html.cssselect('#pagnNextLink')
for link in links:
print(link.attrib['href'])
linkRef = doc.xpath("//a[#id='pagnNextLink']/#href")
print(linkRef)
for post in linkRef:
link="https://www.amazon.com%s" % post
I've tried two ways here and both of them seems to not work.
I'm using a proxy server, for accessing the links and it seems to work, as the "doc" variable is getting populated with the html content. I've checked the links and I'm on the proper page to fetch this xpath/csslink.
Someone more experienced may give better advice on working with your set-up so I will simply indicate what I experienced:
When I used requests I sometimes got the link and sometimes not. When not, the response indicated it was checking I was not a bot and to ensure my browser allowed cookies.
With selenium I reliably got a result in my tests, though this may not be quick enough, or an option for you for other reasons.
from selenium import webdriver
d = webdriver.Chrome()
url = 'https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588'
d.get(url)
link = d.find_element_by_id('pagnNextLink').get_attribute('href')
print(link)
Selenium with proxy (Firefox):
Running Selenium Webdriver with a proxy in Python
Selenium with proxy (Chrome) - covered nicely here:
https://stackoverflow.com/a/11821751/6241235
So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,
Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled". However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.
I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?
import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime
myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)
page_soup = soup(page.content, "html.parser")
scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])
The above code throws an error of course as there is no content in the scheduled array.
My question is, how do I go about getting the data I'm looking for?
I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?
The page is JavaScript rendered. You need Selenium. Here is some code to start on:
from selenium import webdriver
url = 'https://www.scoreboard.com/uk/football/england/premier-league/'
driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()
Or you could pass driver.content in to the BeautifulSoup method. Like this:
soup = BeautifulSoup(driver.page_source, 'html.parser')
Note:
You need to install a webdriver first. I installed chromedriver.
Good luck!