I was wondering how to web scrape the speed amount is the Fast.com website with python
I did some effort, here is what I've done so far:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://fast.com/', headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12"})
soup = BeautifulSoup(response.text, 'lxml')
speed = soup.find('span', {'id' : 'speed-value'}).text
print(speed)
The output is always "0" and sometimes it gives me an error
My goal is to get the speed number in MB/s as shown in the website after the scan.
What I've forgot to do?
BeautifulSoups is more for Static Pages from my personal experience. I would recommend Selenium for more dynamic usage. It would allow for access after javascript and etc has loaded for easier web scraping.
from selenium import webdriver
driver_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(driver_path)
MBPS_CLASS = "speed-results-container"
driver.get("https://fast.com/")
while True:
print(driver.find_elements_by_class_name(MBPS_CLASS)[0].text)
# driver.find_element_by_id("speed-value").text # This works with ID also
Related
I am having difficulties scraping the "product name" and "price" from this website: https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571
Looking to scrap "$4.30" and "Zespri New Zealand Kiwifruit - Green" from the webpage. I have tried various approaches (Beautiful Soup, request_html, selenium) without any success. Attached the sample code approaches I have taken.
I am able to view the 'price' and 'product name' details in the "Developer Tools" tab of Chrome. It seems like that webpage uses Javascript to dynamically load the product information, so the various approaches mentioned above are not able to scrape the information properly.
Appreciate any assistance on this issue.
Requests_html Approach:
from requests_html import HTMLSession
import json
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
json_text=r.html.xpath("//script[#type='application/ld+json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])
Beautiful Soup Approach:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)
Selenium Approach:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
That approach of yours with the embedded JSON needs some refinement. In other words, you're almost there. Also, this can be done with pure requests and bs4.
PS. I'm using different URLS, as the one you give returns a 404.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.fairprice.com.sg/product/11798142",
"https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
"https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
for url in urls:
product_data = (
json.loads(
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("script", type="application/ld+json")
.string[:-1]
)
)
print(product_data["name"])
print(product_data["offers"]["price"])
This should output:
Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8
Following this article, I created my first web scraper with Python. My intention is to scrape Google Shopping, looking for products price. The script works, but I want to search more than one product when I run the script.
So, I'm looping over a list of products like this:
from time import sleep
from random import randint
import requests
from bs4 import BeautifulSoup
# from dataProducts import products
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
stores = ["Submarino", "Casas Bahia", "Extra.com.br", "Americanas.com",
"Pontofrio.com", "Shoptime", "Magazine Luiza", "Amazon.com.br - Retail", "Girafa"]
products = [
{
"name" : "Console Playstation 5",
"lowestPrice" : 4000.0,
"highestPrice" : 4400.0
},
{
"name" : "Controle Xbox Robot",
"lowestPrice" : 320.0,
"highestPrice" : 375.0
}
]
for product in products:
params = {"q": product["name"], 'tbm': 'shop'}
response = requests.get("https://www.google.com/search",
params=params,
headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
# Normal results
for shopping_result in soup.select('.sh-dgr__content'):
product = shopping_result.select_one('.Lq5OHe.eaGTj h4').text
price = shopping_result.select_one('span.kHxwFf span.a8Pemb').text
store = shopping_result.select_one('.IuHnof').text
link = f"https://www.google.com{shopping_result.select_one('.Lq5OHe.eaGTj')['href']}"
if store in stores:
print(product)
print(price)
print(store)
print(link)
print()
print()
print('####################################################################################################################################################')
When I run the script, it doesn't bring all the data. And sometimes, It doesn't even bring any data from the first search. It just show the prints from the second iteration. I tried to put a sleed after the soup line, 10 seconds, after last line of the loop, and nothing changes.
I don't understang why my script can't get all the results for the given products. Can anyone give me a little help?
To start off I would recommend selenium requests will most times not bring data. Second if you are trying to get alerts for stock for PS5's or Xbox's I would scrape a website not google. You will need to install chrome and chrome driver. Link: https://chromedriver.chromium.org/downloads Below is how to use Selenium!
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
ua = UserAgent()
options = Options()
options.add_argument("useragent="+ua.random)
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
browser = webdriver.Chrome("chromedriver location", options=options)
browser.get("https://google.com")
html = browser.page_source
So you need to do:
pip install selenium
pip install fake_useragent to get it setup.
Then using html you can use BS4 to scrape the website.
I have a problem on scraping an e-commerce site using BeautifulSoup. I did some Googling but I still can't solve the problem.
Please refer on the pictures:
1 Chrome F12 :
2 Result :
Here is the site that I tried to scrape: "https://shopee.com.my/search?keyword=h370m"
Problem:
When I tried to open up Inspect Element on Google Chrome (F12), I can see the for the product's name, price, etc. But when I run my python program, I could not get the same code and tag in the python result. After some googling, I found out that this website used AJAX query to get the data.
Anyone can help me on the best methods to get these product's data by scraping an AJAX site? I would like to display the data in a table form.
My code:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://shopee.com.my/search?keyword=h370m')
soup = BeautifulSoup(source.text, 'html.parser')
print(soup)
Welcome to StackOverflow! You can inspect where the ajax request is being sent to and replicate that.
In this case the request goes to this api url. You can then use requests to perform a similar request. Notice however that this api endpoint requires a correct UserAgent header. You can use a package like fake-useragent or just hardcode a string for the agent.
import requests
# fake useragent
from fake_useragent import UserAgent
user_agent = UserAgent().chrome
# or hardcode
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36'
url = 'https://shopee.com.my/api/v2/search_items/?by=relevancy&keyword=h370m&limit=50&newest=0&order=desc&page_type=search'
resp = requests.get(url, headers={
'User-Agent': user_agent
})
data = resp.json()
products = data.get('items')
Welcome to StackOverflow! :)
As an alternative, you can check Selenium
See example usage from documentation:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
When you use requests (or libraries like Scrapy) usually JavaScript not loaded. As #dmitrybelyakov mentioned you can reply these calls or imitate normal user interaction using Selenium.
So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,
Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled". However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.
I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?
import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime
myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)
page_soup = soup(page.content, "html.parser")
scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])
The above code throws an error of course as there is no content in the scheduled array.
My question is, how do I go about getting the data I'm looking for?
I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?
The page is JavaScript rendered. You need Selenium. Here is some code to start on:
from selenium import webdriver
url = 'https://www.scoreboard.com/uk/football/england/premier-league/'
driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()
Or you could pass driver.content in to the BeautifulSoup method. Like this:
soup = BeautifulSoup(driver.page_source, 'html.parser')
Note:
You need to install a webdriver first. I installed chromedriver.
Good luck!
I'm very much a noob in python and scraping. I understand the basics but just cannot get past this problem.
I'm trying to scrape content from www.tweakers.net using python with the requests and beautifullsoup libraries. However, when I scrape, I keep scraping the cookie statement instead of the actual site content. Hope that there is anyone who can help me with code. I got run down with similar issues on other websites so would really like to understand how I can tackle such an issue. This is what I have now.
import time
from bs4 import BeautifulSoup
import requests
from requests.cookies import cookiejar_from_dict
last_agreed_time = str(int(time.time() * 1000))
url = 'www.tweakers.net'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}
session.cookies = cookiejar_from_dict({
'wt3_sid': %3B318816705845986
'wt_cdbeid': 68907f896d9f37509a2f4b0a9495f272
'wt_feid': 2f59b5d845403ada14b462a2c1d0b967
'wt_fweid' 473bb8c305b0b42f5202e14a
})
response = session.get(url)
soup = BeautifulSoup(response.content)
soup.prettify()`
Do not mind the content of the header, I ripped it from somewhere else.
Two of the best imports for scraping would be selenium or cookielib. Here is a link to selenium, http://selenium-python.readthedocs.io/api.html, and cookielib, https://docs.python.org/2/library/cookielib.html.
## added selenium code
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
url = 'www.tweakers.net'
driver = webdriver.Chrome() # or webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get(url)
#add needed cookies
driver.add_cookie({'wt3_sid': %3B318816705845986
'wt_cdbeid': 68907f896d9f37509a2f4b0a9495f272
'wt_feid': 2f59b5d845403ada14b462a2c1d0b967
'wt_fweid' 473bb8c305b0b42f5202e14a})
##this would be to retrieve a cookie
print(driver.get_cookie('string'))
driver.get(url)
soup = BeautifulSoup(driver.content)
soup.prettify()