Why does BeautifulSoup give me the wrong text?

Why does BeautifulSoup give me the wrong text? - python

I've been trying to get the availability status of a product on IKEA's website. On IKEA's website, it says in Dutch: 'not available for delivery', 'only available in the shop', 'not in stock' and 'you've got 365 days of warranty'.
But my code gives me: 'not available for delivery', 'only available for order and pickup', 'checking inventory' and 'you've got 365 days of warranty'.
What do I do wrong which causes the text to not be the same?
This is my code:
import requests
from bs4 import BeautifulSoup
# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
thepage = requests.get(url)
soup = BeautifulSoup(thepage.text, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {'class' : 'range-revamp-product-availability'})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)

With the help of Rajesh, I created this as the script that does exactly what I want. It goes to a certain shop (the one located in Heerlen) and it can check for any out of stock item when it comes back to stock and send you an email whenever it is back in stock.
The script used for this is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import smtplib, ssl
# Fill in the url of the product
url = 'https://www.ikea.com/nl/nl/p/vittsjo-stellingkast-zwartbruin-glas-20213312/'
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/Jem/Downloads/chromedriver')
# Stuff for sending the email
port = 465
password = 'password'
sender_email = 'email'
receiver_email = 'email'
message = """\
Subject: Product is back in stock!
Sent with Python. """
# Keep looping until back in stock
while True:
driver.get(url)
# Go to the location of the shop
btn = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="onetrust-accept-btn-handler"]')))
btn.click()
location = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="content"]/div/div/div/div[2]/div[3]/div/div[5]/div[3]/div/span[1]/div/span/a')))
location.click()
differentlocation = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[1]/div[2]/a')))
differentlocation.click()
searchbar = driver.find_element_by_xpath('//*[#id="change-store-input"]')
# In this part you can choose the location you want to check
searchbar.send_keys('heerlen')
heerlen = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[3]/div')))
heerlen.click()
selecteer = driver.find_element_by_xpath('//*[#id="range-modal-mount-node"]/div/div[3]/div/div[3]/button')
selecteer.click()
close = driver.find_element_by_xpath('//*[#id="range-modal-mount-node"]/div/div[3]/div/div[1]/button')
close.click()
# After you went to the right page, beautifulsoup it
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
# Check whether it is still out of stock, if so wait half an hour and continue
if 'Niet op voorraad in Heerlen' in availabilitysectiontext:
time.sleep(1800)
continue
# If not, send me an email that it is back in stock
else:
print('Email is being sent...')
context = ssl.create_default_context()
with smtplib.SMTP_SSL('smtp.gmail.com', port, context=context) as server:
server.login(sender_email, password)
server.sendmail(sender_email, receiver_email, message)
break

The page markup is getting added with javascript after the initial server response. BeautifulSoup is only able to see the initial response and doesn't execute javascript to get the complete response. If you want to run JavaScript, you'll need to use a headless browser. Otherwise, you'll have to disassemble the JavaScript and see what it does.
You could get this to work with Selenium. I modified your code a bit and got it to work.
Get Selenium:
pip3 install selenium
Download Firefox + geckodriver or Chrome + chromedriver:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
#uncomment the following line if using firefox + geckodriver
#driver = webdriver.Firefox(executable_path='/Users/ralwar/Downloads/geckodriver') # Downloaded from https://github.com/mozilla/geckodriver/releases
# using chrome + chromedriver
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/ralwar/Downloads/chromedriver') # Downloaded from https://chromedriver.chromium.org/downloads
driver.get(url)
time.sleep(5) #adding delay to finish loading the page + javascript completely, you can adjust this
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)
The above code prints:
['Niet beschikbaar voor levering', 'Alleen beschikbaar in de winkel', 'Niet op voorraad in Amersfoort', 'Je hebt 365 dagen om van gedachten te veranderen. ']

Related

Web Scraping price AirBnB data with Python

I have been trying to web scrape an air bnb website to obtain the price without much luck. I have successfully been able to bring in the other areas of interest (home description, home location, reviews, etc). Below is what I've tried unsuccessfully. I think that the fact the "price" on the web page is a 'span class' as opposed to the others which are 'div class' is where my issue is, but I'm speculating.
The URL I'm using is: https://www.airbnb.com/rooms/52361296?category_tag=Tag%3A8173&adults=4&children=0&infants=0&check_in=2022-12-11&check_out=2022-12-18&federated_search_id=6174a078-a823-4fad-827a-7ca652b5e786&source_impression_id=p3_1645454076_foOVSAshSYvdbpbS
This can be placed as the input in the below code.
Any assistance would be greatly appreciated.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from bs4 import BeautifulSoup
import requests
from IPython.display import IFrame
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
stars = soup.find(class_='_c7v1se').get_text()
desc = soup.find(class_='_12nksyy').get_text()
size = soup.find(class_='_jro6t0').get_text()
#checkIn = soup.find(class_='_1acx77b').get_text()
checkIn = soup.find(class_='_12aeg4v').get_text()
#checkOut = soup.find(class_='_14tl4ml5').get_text()
checkOut = soup.find(class_='_12aeg4v').get_text()
Rules = soup.find(class_='cihcm8w dir dir-ltr').get_text()
#location = soup.find(class_='_9ns6hl').get_text()
location = soup.find(class_='_152qbzi').get_text()
HomeType = soup.find(class_='_b8stb0').get_text()
title = soup.title.string
print('Stars: ', stars)
print('')
#Home Type
print('Home Type: ', HomeType)
print('')
#Space Description
print('Description: ', desc)
print('')
print('Rental size: ',size)
print('')
#CheckIn
print('Check In: ', checkIn)
print('')
#CheckOut
print('Check Out: ', checkOut)
print('')
#House Rules
print('House Rules: ',Rules)
print('')
#print(soup.find("button", {"id":"#Id name of the button"}))
#Home Location
print('Home location: ', location)
#Dates available
#print('Dates available: ', soup.find(class_='_1yhfti2').get_text())
print('===================================================================================')
df = pd.DataFrame([{"Title":title, "Stars": stars, "Size":size, "Check In":checkIn, "Check Out":checkOut, "Rules":Rules,
"Location":location, "Home Type":HomeType, "House Desc":desc}])
a = a.append(df)
#Attemping to print the price tag on the website
print(soup.find_all('span', {'class': '_tyxjp1'}))
print(soup.find(class_='_tyxjp1').get_text())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-2d9689dbc836> in <module>
1 #print(soup.find_all('span', {'class': '_tyxjp1'}))
----> 2 print(soup.find(class_='_tyxjp1').get_text())
AttributeError: 'NoneType' object has no attribute 'get_text'

I see you are using the requests module to scrape airbnb.
That module is extremely versatile and works on websites that have static content.
However, it has one major drawback: it doesn't render content created by javascript.
This is a problem, as most of the websites these days create additional html elements using javascript once the user lands on the web page.
The airbnb price block is created exactly like that - using javascript.
There are many ways to scrape that kind of content.
My favourite way is to use selenium.
It's basically a library that allows you to launch a real browser and communicate with it using your programming language of choice.
Here's how you can easily use selenium.
First, set it up. Notice the headless option which can be toggled on and off.
Toggle it off if you want to see how the browser loads the webpage
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
Then, navigate to the website
# navigate to airbnb
driver.get(url)
Next, wait until the price block loads.
It might appear near instantaneous to us, but depending on the speed of your internet connection it might take a few seconds
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the price
# print the price
print(price_element.get_attribute('innerHTML'))
I added my code to your example so you could play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# navigate to airbnb
driver.get(url)
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
# print the price
print(price_element.get_attribute('innerHTML'))
Keep in mind that your IP might eventually get banned for scraping AirBnb.
To work around that it is always a good idea to use proxy IPs and rotate them.
Follow this rotating proxies tutorial to avoid getting blocked.
Hope that helps!

how to scrape data from shopee using beautiful soup

I'm currently a student where currently I studied beautifulsoup so my lecturer as me to scrape data from shopee however I cannot scrape the details of the products. Currently, I'm trying to scrape data from https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I only want to scrape the name and price of the products. can someone tell me why I cannot scrape the data using beautifulsoup ?
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)

This question is a bit tricky (for python beginners) because it involves a combination of selenium (for headless browsing) and beautifulsoup (for html data extraction). Moreover, the problem becomes difficult because the Document Object Model (DOM) is encased within javascripting. We know javascript is there because we get an empty response from the website when accessed only using beautifulsoup, like, for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
Therefore, to extract data from such a webpage which has a scripting language controlling its DOM, we have to use selenium for headless browsing (this tells the website that a browser is accessing it). We also have to use some sort of delay parameter, (which tells the website that it's accessed by a human). For this, the function WebdriverWait() from the selenium library will help.
I now present snippets of code that explain the process.
First, import the requisite libraries
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
Next, initialize the settings for the headless browser. I'm using chrome.
# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
options = chrome_options)
browser.get(base_url)
delay = 5 #secods
Next, I declare empty list variables to hold the data.
# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
try:
WebDriverWait(browser, delay)
print ("Page is ready")
sleep(5)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
# find_all() returns an array of elements.
# We have to go through all of them and select that one you are need. And than call get_text()
for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
print(item_n.get_text())
item_name.append(item_n.text)
# find the price of items
for item_c in soup.find_all('span', class_='_341bF0'):
print(item_c.get_text())
item_cost.append(item_c.text)
# find initial item cost
for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
print(item_ic.get_text())
item_init_cost.append(item_ic.text)
# find total number of items sold/month
for items_s in soup.find_all('div',class_ = '_18SLBt'):
print(items_s.get_text())
items_sold.append(item_ic.text)
# find item discount percent
for dp in soup.find_all('span', class_ = 'percent'):
print(dp.get_text())
discount_percent.append(dp.text)
# find item location
for il in soup.find_all('div', class_ = '_3amru2'):
print(il.get_text())
item_loc.append(il.text)
break # it will break from the loop once the specific element will be present.
except TimeoutException:
print ("Loading took too much time!-Try again")
Thereafter, I use the zip function to combine the different list items.
rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)
Finally, I write this data to disc,
import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
As a good practice, its wise to close the headless browser once the task is complete. And so i code it as,
# close the automated browser
browser.close()
Result
Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor
Note to the readers
The OP brought to my attention that the xpath was not working as given in my answer. I checked the website again after 2 days and noticed a strange phenomenon. The class_ attribute of the div class had indeed changed. I found a similar Q. But it did not help much. So for now, I'm concluding the div attributes in the shoppee website can change again. I leave this as an open problem to solve later.
Note to the OP
Ana, the above code will work for just one page i.e., it will work only for the webpage, https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I invite you to further enhance your skills by solving how to scrape data for multiple webpages under the sales tag. Your hint is the 1/9 seen on the top right of the this page and/or the 1 2 3 4 5 links at the bottom of the page. Another hint for you is to look at the urljoin in the urlparse library. Hope this should get you started.
Helpful resources
XPATH tutorial

The page is loading after the first request sends to the page by ajax async so sending one request and getting the source of the page you want seems not possible.
You should simulate a browser then you can get the source and you can use the beautifulsoup. See the code:
BeautifulSoup way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
search = soup.select_one('.shop-search-result-view')
products = search.find_all('a')
for p in products:
name = p.select('div[data-sqe="name"] > div')[0].get_text()
price = p.select('div > div:nth-child(2) > div:nth-child(2)')[0].get_text()
product = p.select('div > div:nth-child(2) > div:nth-child(4)')[0].get_text()
print('name: ' + name)
print('price: ' + price)
print('product: ' + product + '\n')
However, using selenium is a good approach to get everything you want. See the example below:
Selenium Way
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales")
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.shop-search-result-view')))
search = driver.find_element_by_css_selector('.shop-search-result-view')
products = search.find_elements_by_css_selector('a')
for p in products:
name = p.find_element_by_css_selector('div[data-sqe="name"] > div').text
price = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(2)').text
product = p.find_element_by_css_selector('div > div:nth-child(2) > div:nth-child(4)').text
print('name: ' + name)
print('price: ' + price.replace('\n', ' | '))
print('product: ' + product + '\n')

please post your code so we can help.
or you can start like this.. :)
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReg
my_url = "<url>"
uClient = uReg(my_url)
page_html = uClient.read()

Issue when scraping data on McMaster-Carr website

I'm writing a crawler for McMaster-Carr. For example, the page https://www.mcmaster.com/98173A200 , if I open the page directly in browser, I can view all the product data.
Because the data is in dynamically-loaded content, so I'm using Selenium + bs4.
if __name__ == "__main__":
url = "https://www.mcmaster.com/98173A200"
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Chrome("C:/chromedriver/chromedriver.exe", options=options)
driver.set_page_load_timeout(20)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
delay = 20
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'MainContent')))
except TimeoutException:
print("Timeout loading DOM!")
print(soup)
However, if I run the code I would get a login dialog, which I wouldn't get if I open this page directly in a browser like I mentioned.
I also tried logging in with the code below
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'Email')))
print("Page is ready!!")
input("Press Enter to continue...")
except TimeoutException:
print("Loading took too much time!")
email_input.send_keys(email)
password_input = driver.find_element_by_id('Password')
password_input.send_keys(password)
login_button = driver.find_element_by_class_name("FormButton_primaryButton__1kNXY")
login_button.click()
Then it shows access restricted.
I compared the requested header in the page opened by Selenium and the page in my browser, I couldn't find anything wrong. I also tried other webdrivers like PhantomJS and FireFox, and I got the same result.
I also tried using random user-agent using the code below
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names,
operating_systems=operating_systems,
limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
chrome_options = Options()
chrome_options.add_argument('user-agent=' + user_agent)
Still same result.
The developer tool in the page opened by Selenium showed there were a bunch of errors. I guess the tokenauthorization one is the key to this issue, but I don't know what should I do with it.
Any help would be appreciated!

The reason you saw a login window is that you were accessing McMaster carr via a chrome driver. When the server recognizes your behaviour, it will require you to sign in.
A typical login wouldn't work if you haven't been authenticated by McMaster (need to sign NDA)
You should look into McMaster API. With the API, you can access the database directly. However, you need to sign an NDA with McMaster Carr before obtaining access to the API. https://www.mcmaster.com/help/api/

Why do I only get first page data when using selenium?

I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more"?
I want to crawl reviews from imdb using python. It only displays 25 reviews until I click "load more" button. I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[#id="browse-itemsprimary"]/li[2]/button/span/span[2]'
driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(movie_review.text, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
print("Complete")
I want all the reviews, but now I can only get the first 25.

You have several issues in your script. Hardcoded wait is very inconsistent and certainly the worst option to comply. The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again. Moreover, every title produces a huge line gap in the output which needs to be properly stripped. I've slightly changed your script to reflect the suggestion I've given above.
Try this to get the required output:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
URL = "https://www.imdb.com/title/tt4209788/reviews"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
for elem in soup.find_all(class_='imdb-user-review'):
name = elem.find(class_='title').get_text(strip=True)
print(name)
driver.quit()

Your code is fine. Great even. But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button. That's why you are getting the same 25 reviews listed all the time.
When you use Selenium to control the web browser, you are clicking the 'Load More' button. This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.
The bottom line is that JavaScript (which is run in the web browser) updates the page. But in your Python program, you only get the HTML once for the page statically using the Requests library.
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60
To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews. Then, have BeautifulSoup parse the HTML again. We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.
So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup. Something like this should work:
while True:
try:
allReviewsDiv = driver.find_element_by_xpath("//div[#class='lister-list']")
allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
pdb.set_trace()
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break

Loop through url with Selenium Webdriver

The below request finds the contest id's for the day. I am trying to pass that str into the driver.get url so it will go to each individual contest url and download each contests CSV. I would imagine you have to write a loop but I'm not sure what that would look like with a webdriver.
import time
from selenium import webdriver
import requests
import datetime
req = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA')
data = req.json()
for ids in data:
contest = ids['id']
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
driver.get('https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby');
time.sleep(2) # Let DK Load!
search_box = driver.find_element_by_name('username')
search_box.send_keys('username')
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys('password')
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
time.sleep(2) # Let Page Load, If not it will go to Account!
driver.get('https://www.draftkings.com/contest/exportfullstandingscsv/' + str(contest) + '')

Try in following order:
import time
from selenium import webdriver
import requests
import datetime
req = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA')
data = req.json()
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
driver.get('https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby')
time.sleep(2) # Let DK Load!
search_box = driver.find_element_by_name('username')
search_box.send_keys('Pr0c3ss')
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys('generic1!')
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
time.sleep(2) # Let Page Load, If not it will go to Account!
for ids in data:
contest = ids['id']
driver.get('https://www.draftkings.com/contest/exportfullstandingscsv/' + str(contest) + '')

You do not need to send load selenium for x nos of times to download x nos of files. Requests and selenium can share cookies. This means you can login to site with selenium, retrieve the login details and share them with requests or any other application. Take a moment to check out httpie, https://httpie.org/doc#sessions it seems you manually control sessions like requests does.
For requests look at: http://docs.python-requests.org/en/master/user/advanced/?highlight=sessions
For selenium look at: http://selenium-python.readthedocs.io/navigating.html#cookies
Looking at the Webdriver block,you can add proxies and load the browser headless or live: Just comment the headless line and it should load the browser live, this makes debugging easy, easy to understand movements and changes to site api/html.
import time
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import requests
import datetime
import shutil
LOGIN = 'https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby'
BASE_URL = 'https://www.draftkings.com/contest/exportfullstandingscsv/'
USER = ''
PASS = ''
try:
data = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA').json()
except BaseException as e:
print(e)
exit()
ids = [str(item['id']) for item in data]
# Webdriver block
driver = webdriver.Chrome()
options.add_argument('headless')
options.add_argument('window-size=800x600')
# options.add_argument('--proxy-server= IP:PORT')
# options.add_argument('--user-agent=' + USER_AGENT)
try:
driver.get(URL)
driver.implicitly_wait(2)
except WebDriverException:
exit()
def login(USER, PASS)
'''
Login to draftkings.
Retrieve authentication/authorization.
http://selenium-python.readthedocs.io/waits.html#implicit-waits
http://selenium-python.readthedocs.io/api.html#module-selenium.common.exceptions
'''
search_box = driver.find_element_by_name('username')
search_box.send_keys(USER)
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys(PASS)
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
driver.implicitly_wait(2)
cookies = driver.get_cookies()
return cookies
site_cookies = login(USER, PASS)
def get_csv_files(id):
'''
get each id and download the file.
'''
session = rq.session()
for cookie in site_cookies:
session.cookies.update(cookies)
try:
_data = session.get(BASE_URL + id)
with open(id + '.csv', 'wb') as f:
shutil.copyfileobj(data.raw, f)
except BaseException:
return
map(get_csv_files, ids)

will this help
for ids in data:
contest = ids['id']
driver.get('https://www.draftkings.com/contest/exportfullstandingscsv/' + str(contest) + '')

May be its time to decompose it a bit.
Create few isolated functions, which are:
0. (optional) Provide authorisation to target url.
1. Collecting all needed id (first part of your code).
2. Exporting CSV for specific id (second part of your code).
3. Loop through list of id and call func #2 for each.
Share chromedriver as input argument for each of them to save driver state and auth-cookies.
Its works fine, make code clear and readable.

I think you can set the URL of a contest to an a element in the landing page, and then click on it. Then repeat the step with other ID.
See my code below.
req = requests.get('https://www.draftkings.com/lobby/getlivecontests?sport=NBA')
data = req.json()
contests = []
for ids in data:
contests.append(ids['id'])
driver = webdriver.Chrome() # Optional argument, if not specified will search path.
driver.get('https://www.draftkings.com/account/sitelogin/false?returnurl=%2Flobby');
time.sleep(2) # Let DK Load!
search_box = driver.find_element_by_name('username')
search_box.send_keys('username')
search_box2 = driver.find_element_by_name('password')
search_box2.send_keys('password')
submit_button = driver.find_element_by_xpath('//*[#id="react-mobile-home"]/section/section[2]/div[3]/button/span')
submit_button.click()
time.sleep(2) # Let Page Load, If not it will go to Account!
for id in contests:
element = driver.find_element_by_css_selector('a')
script1 = "arguments[0].setAttribute('download',arguments[1]);"
driver.execute_script(script1, element, str(id) + '.pdf')
script2 = "arguments[0].setAttribute('href',arguments[1]);"
driver.execute_script(script2, element, 'https://www.draftkings.com/contest/exportfullstandingscsv/' + str(id))
time.sleep(1)
element.click()
time.sleep(3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does BeautifulSoup give me the wrong text? - python

Related

Web Scraping price AirBnB data with Python

how to scrape data from shopee using beautiful soup

Issue when scraping data on McMaster-Carr website

Why do I only get first page data when using selenium?

Loop through url with Selenium Webdriver

Categories

Resources