Web Scraping price AirBnB data with Python - python

I have been trying to web scrape an air bnb website to obtain the price without much luck. I have successfully been able to bring in the other areas of interest (home description, home location, reviews, etc). Below is what I've tried unsuccessfully. I think that the fact the "price" on the web page is a 'span class' as opposed to the others which are 'div class' is where my issue is, but I'm speculating.
The URL I'm using is: https://www.airbnb.com/rooms/52361296?category_tag=Tag%3A8173&adults=4&children=0&infants=0&check_in=2022-12-11&check_out=2022-12-18&federated_search_id=6174a078-a823-4fad-827a-7ca652b5e786&source_impression_id=p3_1645454076_foOVSAshSYvdbpbS
This can be placed as the input in the below code.
Any assistance would be greatly appreciated.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from bs4 import BeautifulSoup
import requests
from IPython.display import IFrame
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
stars = soup.find(class_='_c7v1se').get_text()
desc = soup.find(class_='_12nksyy').get_text()
size = soup.find(class_='_jro6t0').get_text()
#checkIn = soup.find(class_='_1acx77b').get_text()
checkIn = soup.find(class_='_12aeg4v').get_text()
#checkOut = soup.find(class_='_14tl4ml5').get_text()
checkOut = soup.find(class_='_12aeg4v').get_text()
Rules = soup.find(class_='cihcm8w dir dir-ltr').get_text()
#location = soup.find(class_='_9ns6hl').get_text()
location = soup.find(class_='_152qbzi').get_text()
HomeType = soup.find(class_='_b8stb0').get_text()
title = soup.title.string
print('Stars: ', stars)
print('')
#Home Type
print('Home Type: ', HomeType)
print('')
#Space Description
print('Description: ', desc)
print('')
print('Rental size: ',size)
print('')
#CheckIn
print('Check In: ', checkIn)
print('')
#CheckOut
print('Check Out: ', checkOut)
print('')
#House Rules
print('House Rules: ',Rules)
print('')
#print(soup.find("button", {"id":"#Id name of the button"}))
#Home Location
print('Home location: ', location)
#Dates available
#print('Dates available: ', soup.find(class_='_1yhfti2').get_text())
print('===================================================================================')
df = pd.DataFrame([{"Title":title, "Stars": stars, "Size":size, "Check In":checkIn, "Check Out":checkOut, "Rules":Rules,
"Location":location, "Home Type":HomeType, "House Desc":desc}])
a = a.append(df)
#Attemping to print the price tag on the website
print(soup.find_all('span', {'class': '_tyxjp1'}))
print(soup.find(class_='_tyxjp1').get_text())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-2d9689dbc836> in <module>
1 #print(soup.find_all('span', {'class': '_tyxjp1'}))
----> 2 print(soup.find(class_='_tyxjp1').get_text())
AttributeError: 'NoneType' object has no attribute 'get_text'

I see you are using the requests module to scrape airbnb.
That module is extremely versatile and works on websites that have static content.
However, it has one major drawback: it doesn't render content created by javascript.
This is a problem, as most of the websites these days create additional html elements using javascript once the user lands on the web page.
The airbnb price block is created exactly like that - using javascript.
There are many ways to scrape that kind of content.
My favourite way is to use selenium.
It's basically a library that allows you to launch a real browser and communicate with it using your programming language of choice.
Here's how you can easily use selenium.
First, set it up. Notice the headless option which can be toggled on and off.
Toggle it off if you want to see how the browser loads the webpage
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
Then, navigate to the website
# navigate to airbnb
driver.get(url)
Next, wait until the price block loads.
It might appear near instantaneous to us, but depending on the speed of your internet connection it might take a few seconds
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the price
# print the price
print(price_element.get_attribute('innerHTML'))
I added my code to your example so you could play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# navigate to airbnb
driver.get(url)
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
# print the price
print(price_element.get_attribute('innerHTML'))
Keep in mind that your IP might eventually get banned for scraping AirBnb.
To work around that it is always a good idea to use proxy IPs and rotate them.
Follow this rotating proxies tutorial to avoid getting blocked.
Hope that helps!

Related

How To Scrape Content With Load More Pages Using Selenium Python

I need to scrape the titles for all blog post articles via a Load More button as set by my desired range for i in range(1,3):
At present I'm only able to capture the titles for the first page even though i'm able to navigate to the next page using selenium.
Any help would be much appreciated.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import time
# Selenium Routine
from requests_html import HTMLSession
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
# Removes SSL Issues With Chrome
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('log-level=3')
options.add_argument('--disable-notifications')
#options.add_argument('--headless') # Comment to view browser actions
# Get website url
urls = "https://jooble.org/blog/"
r = requests.get(urls)
driver = webdriver.Chrome(executable_path="C:\webdrivers\chromedriver.exe",options=options)
driver.get(urls)
productlist = []
for i in range(1,3):
# Get Page Information
soup = BeautifulSoup(r.content, features='lxml')
items = soup.find_all('div', class_ = 'post')
print(f'LOOP: start [{len(items)}]')
for single_item in items:
title = single_item.find('div', class_ = 'front__news-title').text.strip()
print('Title:', title)
product = {
'Title': title,
}
productlist.append(product)
print()
time.sleep(5)
WebDriverWait(driver, 40).until(EC.element_to_be_clickable((By.XPATH,"//button[normalize-space()='Show more']"))).send_keys(Keys.ENTER)
driver.close()
# Save Results
df = pd.DataFrame(productlist)
df.to_csv('Results.csv', index=False)
It do not need selenium overhead in this case, cause you can use requests directly to get quetsion specific data via api.
Try to check the network tab in your browsers devtools if you click the button and you get the url that is requested to load more content. Iterate and set parameter value &page={i}.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
for i in range (1,3):
url = f'https://jooble.org/blog/wp-admin/admin-ajax.php?id=&post_id=0&slug=home&canonical_url=https%3A%2F%2Fjooble.org%2Fblog%2F&posts_per_page=6&page={i}&offset=20&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&lang=en&order=DESC&orderby=date&action=alm_get_posts&query_type=standard'
r=requests.get(url)
if r.status_code != 200:
print(f'Error occured: {r.status_code} on url: {url}')
else:
soup = BeautifulSoup(str(r.json()['html']))
for e in soup.select('.type-post'):
data.append({
'title':e.select_one('.front__news-title').get_text(strip=True),
'description':e.select_one('.front__news-description').get_text(strip=True),
'url':e.a.get('href')
})
pd.DataFrame(data)
Output
title
description
url
0
How To Become A Copywriter
If you have a flair for writing, you might consider leveraging your talents to earn some dough by working as a copywriter. The primary aim of a copywriter is to…
https://jooble.org/blog/how-to-become-a-copywriter/
1
How to Find a Job in 48 Hours
A job search might sound scary for many people. However, it doesn't have to be challenging, long, or overwhelming. With Jooble, it is possible to find the best employment opportunities…
https://jooble.org/blog/how-to-find-a-job-in-48-hours/
2
17 Popular Jobs That Involve Working With Animals
If you are interested in caring for or helping animals, you can build a successful career in this field. The main thing is to find the right way. Working with…
https://jooble.org/blog/17-popular-jobs-that-involve-working-with-animals/
3
How to Identify Phishing and Email Scam
What Phishing and Email Scam Are Cybercrime is prospering, and more and more internet users are afflicted daily. The best example of an online scam is the phishing approach -…
https://jooble.org/blog/how-to-identify-phishing-and-email-scam/
4
What To Do After Getting Fired
For many people, thoughts of getting fired tend to be spine-chilling. No wonder, since it means your everyday life gets upside down in minutes. Who would like to go through…
https://jooble.org/blog/what-to-do-after-getting-fired/
5
A mobile application for a job search in 69 countries has appeared
Jooble, a job search site in 69 countries, has launched the Jooble Job Search mobile app for iOS and Android. It will help the searcher view vacancies more conveniently and…
https://jooble.org/blog/a-mobile-application-for-a-job-search-in-69-countries-has-appeared/
...

Adding an open close Google Chrome browser to Selenium linkedin_scraper code

I am trying to scrape some LinkedIn profiles of well known people. The code takes a bunch of LinkedIn profile URLS and then uses Selenium and scrape_linkedin to collect the information and save it into a folder as a .json file.
The problem I am running into is that LinkedIn naturally blocks the scraper from collecting some profiles. I am always able to get the first profile in the list of URLs. I put this down to the fact that it opens a new Google Chrome window and then goes to the LinkedIn page. (I could be wrong on this point however.)
What I would like to do is to add to the for loop a line which opens a new Google Chrome session and once the scraper has collected the data close the Google Chrome session such that on the next iteration in the loop it will open up a fresh new Google Chrome session.
From the package website here it states:
driver {selenium.webdriver}: driver type to use
default: selenium.webdriver.Chrome
Looking at the Selenium package website here I see:
driver = webdriver.Firefox()
...
driver.close()
So Selenium does have a close() option.
How can I add an open and close Google Chrome browser to the for loop?
I have tried alternative methods to try and collect the data such as changing the time.sleep() to 10 minutes, to changing the scroll_increment and scroll_pause but it still does not download the whole profile after the first one has been collected.
Code:
from datetime import datetime
from scrape_linkedin import ProfileScraper
import pandas as pd
import json
import os
import re
import time
my_profile_list = ['https://www.linkedin.com/in/williamhgates/', 'https://www.linkedin.com/in/christinelagarde/', 'https://www.linkedin.com/in/ursula-von-der-leyen/']
# To get LI_AT key
# Navigate to www.linkedin.com and log in
# Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
# Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
# Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
# Find and copy the li_at value
myLI_AT_Key = 'INSERT LI_AT Key'
with ProfileScraper(cookie=myLI_AT_Key, scroll_increment = 50, scroll_pause = 0.8) as scraper:
for link in my_profile_list:
print('Currently scraping: ', link, 'Time: ', datetime.now())
profile = scraper.scrape(url=link)
dataJSON = profile.to_dict()
profileName = re.sub('https://www.linkedin.com/in/', '', link)
profileName = profileName.replace("?originalSubdomain=es", "")
profileName = profileName.replace("?originalSubdomain=pe", "")
profileName = profileName.replace("?locale=en_US", "")
profileName = profileName.replace("?locale=es_ES", "")
profileName = profileName.replace("?originalSubdomain=uk", "")
profileName = profileName.replace("/", "")
with open(os.path.join(os.getcwd(), 'ScrapedLinkedInprofiles', profileName + '.json'), 'w') as json_file:
json.dump(dataJSON, json_file)
time.sleep(10)
print('The first observation scraped was:', my_profile_list[0:])
print('The last observation scraped was:', my_profile_list[-1:])
print('END')
Here is a way to open and close tabs/browser.
from datetime import datetime
from scrape_linkedin import ProfileScraper
import random #new import made
from selenium import webdriver #new import made
import pandas as pd
import json
import os
import re
import time
my_profile_list = ['https://www.linkedin.com/in/williamhgates/', 'https://www.linkedin.com/in/christinelagarde/',
'https://www.linkedin.com/in/ursula-von-der-leyen/']
myLI_AT_Key = 'INSERT LI_AT Key'
for link in my_profile_list:
my_driver = webdriver.Chrome() #if you don't have Chromedrive in the environment path then use the next line instead of this
#my_driver = webdriver.Chrome(executable_path=r"C:\path\to\chromedriver.exe")
#sending our driver as the driver to be used by srape_linkedin
#you can also create driver options and pass it as an argument
ps = ProfileScraper(cookie=myLI_AT_Key, scroll_increment=random.randint(10,50), scroll_pause=0.8 + random.uniform(0.8,1),driver=my_driver) #changed name, default driver and scroll_pause time and scroll_increment made a little random
print('Currently scraping: ', link, 'Time: ', datetime.now())
profile = ps.scrape(url=link) #changed name
dataJSON = profile.to_dict()
profileName = re.sub('https://www.linkedin.com/in/', '', link)
profileName = profileName.replace("?originalSubdomain=es", "")
profileName = profileName.replace("?originalSubdomain=pe", "")
profileName = profileName.replace("?locale=en_US", "")
profileName = profileName.replace("?locale=es_ES", "")
profileName = profileName.replace("?originalSubdomain=uk", "")
profileName = profileName.replace("/", "")
with open(os.path.join(os.getcwd(), 'ScrapedLinkedInprofiles', profileName + '.json'), 'w') as json_file:
json.dump(dataJSON, json_file)
time.sleep(10 + random.randint(0,5)) #added randomness to the sleep time
#this will close your browser at the end of every iteration
my_driver.quit()
print('The first observation scraped was:', my_profile_list[0:])
print('The last observation scraped was:', my_profile_list[-1:])
print('END')
This scraper by default uses Chrome as the browser but also gives the freedom to choose what browser you want to use in all possible places like CompanyScraper, ProfileScraper, etc.
I have just changed the default arguments to be passed in the initialization of ProfileScrapper() class and made your driver run browser and close it rather than the default one, added some random time into the wait/sleep intervals as you had requested(you can tweak it as per your needs. You can change the Random Noise I have added to your comfort.
There is no need to use scrape_in_parallel() as I had suggested in my comments but if you want to then, you can define the number of browser instances(num_instances) you want to run along with your own dictionary of drivers having it's own options too(in a another dictionary) :
from scrape_linkedin import scrape_in_parallel, CompanyScraper
from selenium import webdriver
driver1 = webdriver.Chrome()
driver2 = webdriver.Chrome()
driver3 = webdriver.Chrome()
driver4 = webdriver.Chrome()
my_drivers = [driver1,driver2,driver3,driver4]
companies = ['facebook', 'google', 'amazon', 'microsoft', ...]
driver_dict = {}
for i in range(1,len(my_drivers)+1):
driver_dict[i] = my_drivers[i-1]
#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
scraper_type=CompanyScraper,
items=companies,
output_file="companies.json",
num_instances=4,
driver= driver_dict
)
It's an open source code and since it's written solely in Python you can understand the source code very easily. It's quite an interesting scraper, thank you for letting me know about it too!
NOTE:
There are some concerning unresolved issues in this module as it's told in it's GitHub Issues tab. I would wait for a few more forks and updates if I were you if this doesn't work properly.

Why does BeautifulSoup give me the wrong text?

I've been trying to get the availability status of a product on IKEA's website. On IKEA's website, it says in Dutch: 'not available for delivery', 'only available in the shop', 'not in stock' and 'you've got 365 days of warranty'.
But my code gives me: 'not available for delivery', 'only available for order and pickup', 'checking inventory' and 'you've got 365 days of warranty'.
What do I do wrong which causes the text to not be the same?
This is my code:
import requests
from bs4 import BeautifulSoup
# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
thepage = requests.get(url)
soup = BeautifulSoup(thepage.text, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {'class' : 'range-revamp-product-availability'})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)
With the help of Rajesh, I created this as the script that does exactly what I want. It goes to a certain shop (the one located in Heerlen) and it can check for any out of stock item when it comes back to stock and send you an email whenever it is back in stock.
The script used for this is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import smtplib, ssl
# Fill in the url of the product
url = 'https://www.ikea.com/nl/nl/p/vittsjo-stellingkast-zwartbruin-glas-20213312/'
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/Jem/Downloads/chromedriver')
# Stuff for sending the email
port = 465
password = 'password'
sender_email = 'email'
receiver_email = 'email'
message = """\
Subject: Product is back in stock!
Sent with Python. """
# Keep looping until back in stock
while True:
driver.get(url)
# Go to the location of the shop
btn = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="onetrust-accept-btn-handler"]')))
btn.click()
location = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="content"]/div/div/div/div[2]/div[3]/div/div[5]/div[3]/div/span[1]/div/span/a')))
location.click()
differentlocation = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[1]/div[2]/a')))
differentlocation.click()
searchbar = driver.find_element_by_xpath('//*[#id="change-store-input"]')
# In this part you can choose the location you want to check
searchbar.send_keys('heerlen')
heerlen = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[3]/div')))
heerlen.click()
selecteer = driver.find_element_by_xpath('//*[#id="range-modal-mount-node"]/div/div[3]/div/div[3]/button')
selecteer.click()
close = driver.find_element_by_xpath('//*[#id="range-modal-mount-node"]/div/div[3]/div/div[1]/button')
close.click()
# After you went to the right page, beautifulsoup it
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
# Check whether it is still out of stock, if so wait half an hour and continue
if 'Niet op voorraad in Heerlen' in availabilitysectiontext:
time.sleep(1800)
continue
# If not, send me an email that it is back in stock
else:
print('Email is being sent...')
context = ssl.create_default_context()
with smtplib.SMTP_SSL('smtp.gmail.com', port, context=context) as server:
server.login(sender_email, password)
server.sendmail(sender_email, receiver_email, message)
break
The page markup is getting added with javascript after the initial server response. BeautifulSoup is only able to see the initial response and doesn't execute javascript to get the complete response. If you want to run JavaScript, you'll need to use a headless browser. Otherwise, you'll have to disassemble the JavaScript and see what it does.
You could get this to work with Selenium. I modified your code a bit and got it to work.
Get Selenium:
pip3 install selenium
Download Firefox + geckodriver or Chrome + chromedriver:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
#uncomment the following line if using firefox + geckodriver
#driver = webdriver.Firefox(executable_path='/Users/ralwar/Downloads/geckodriver') # Downloaded from https://github.com/mozilla/geckodriver/releases
# using chrome + chromedriver
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/ralwar/Downloads/chromedriver') # Downloaded from https://chromedriver.chromium.org/downloads
driver.get(url)
time.sleep(5) #adding delay to finish loading the page + javascript completely, you can adjust this
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)
The above code prints:
['Niet beschikbaar voor levering', 'Alleen beschikbaar in de winkel', 'Niet op voorraad in Amersfoort', 'Je hebt 365 dagen om van gedachten te veranderen. ']

how to scrape data from shopee using beautiful soup

I'm currently a student where currently I studied beautifulsoup so my lecturer as me to scrape data from shopee however I cannot scrape the details of the products. Currently, I'm trying to scrape data from https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I only want to scrape the name and price of the products. can someone tell me why I cannot scrape the data using beautifulsoup ?
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)
This question is a bit tricky (for python beginners) because it involves a combination of selenium (for headless browsing) and beautifulsoup (for html data extraction). Moreover, the problem becomes difficult because the Document Object Model (DOM) is encased within javascripting. We know javascript is there because we get an empty response from the website when accessed only using beautifulsoup, like, for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
Therefore, to extract data from such a webpage which has a scripting language controlling its DOM, we have to use selenium for headless browsing (this tells the website that a browser is accessing it). We also have to use some sort of delay parameter, (which tells the website that it's accessed by a human). For this, the function WebdriverWait() from the selenium library will help.
I now present snippets of code that explain the process.
First, import the requisite libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
Next, initialize the settings for the headless browser. I'm using chrome.
# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
Next, I declare empty list variables to hold the data.
# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
# find_all() returns an array of elements.
# We have to go through all of them and select that one you are need. And than call get_text()
for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
item_name.append(item_n.text)
# find the price of items
for item_c in soup.find_all('span', class_='_341bF0'):
print(item_c.get_text())
item_cost.append(item_c.text)
# find initial item cost
for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
print(item_ic.get_text())
item_init_cost.append(item_ic.text)
# find total number of items sold/month
for items_s in soup.find_all('div',class_ = '_18SLBt'):
print(items_s.get_text())
items_sold.append(item_ic.text)
# find item discount percent
for dp in soup.find_all('span', class_ = 'percent'):
print(dp.get_text())
discount_percent.append(dp.text)
# find item location
for il in soup.find_all('div', class_ = '_3amru2'):
print(il.get_text())
item_loc.append(il.text)
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print ("Loading took too much time!-Try again")
Thereafter, I use the zip function to combine the different list items.
rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)
Finally, I write this data to disc,
import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
As a good practice, its wise to close the headless browser once the task is complete. And so i code it as,
# close the automated browser
browser.close()
Result
Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor
Note to the readers
The OP brought to my attention that the xpath was not working as given in my answer. I checked the website again after 2 days and noticed a strange phenomenon. The class_ attribute of the div class had indeed changed. I found a similar Q. But it did not help much. So for now, I'm concluding the div attributes in the shoppee website can change again. I leave this as an open problem to solve later.
Note to the OP
Ana, the above code will work for just one page i.e., it will work only for the webpage, https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I invite you to further enhance your skills by solving how to scrape data for multiple webpages under the sales tag. Your hint is the 1/9 seen on the top right of the this page and/or the 1 2 3 4 5 links at the bottom of the page. Another hint for you is to look at the urljoin in the urlparse library. Hope this should get you started.
Helpful resources
XPATH tutorial
The page is loading after the first request sends to the page by ajax async so sending one request and getting the source of the page you want seems not possible.
You should simulate a browser then you can get the source and you can use the beautifulsoup. See the code:
BeautifulSoup way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')
for p in products:
name = p.select('div[data-sqe="name"] > div')[0].get_text()
price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
print('name: ' + name)
print('price: ' + price)
print('product: ' + product + '\n')
However, using selenium is a good approach to get everything you want. See the example below:
Selenium Way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')
for p in products:
name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
print('name: ' + name)
print('price: ' + price.replace('\n', ' | '))
print('product: ' + product + '\n')
please post your code so we can help.
or you can start like this.. :)
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg
my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()

How to get all the data from a webpage manipulating lazy-loading method?

I've written some script in python using selenium to scrape name and price of different products from redmart website. My scraper clicks on a link, goes to its target page, parses data from there. However, the issue I'm facing with this crawler is it scrapes very few items from a page because of the webpage's slow-loading method. How can I get all the data from each page controlling the lazy-loading process? I tried with "execute script" method but i did it wrongly. Here is the script I'm trying with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://redmart.com/bakery")
wait = WebDriverWait(driver, 10)
counter = 0
while True:
try:
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li.image-facets-pill")))
driver.find_elements_by_css_selector('img.image-facets-pill-image')[counter].click()
counter += 1
except IndexError:
break
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
for elems in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productPreview"))):
name = elems.find_element_by_css_selector('h4[title] a').text
price = elems.find_element_by_css_selector('span[class^="ProductPrice__"]').text
print(name, price)
driver.back()
driver.quit()
I guess you could use Selenium for this but if speed is your concern aften #Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.
If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:
categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()
For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:
bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]
Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.
This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):
items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'
for uri in uris:
r = requests.get(items_API.format(uri)).json()
products = r['products']
for product in products:
name = product['title']
# testing for promo_price - if its 0.0 go with the normal price
price = product['pricing']['promo_price']
if price == 0.0:
price = product['pricing']['price']
print("Name: {}. Price: {}".format(name, price))
Edit: If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.
check_height = driver.execute_script("return document.body.scrollHeight;")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
height = driver.execute_script("return document.body.scrollHeight;")
if height == check_height:
break
check_height = height

Categories

Resources