Issue when scraping data on McMaster-Carr website - python

I'm writing a crawler for McMaster-Carr. For example, the page https://www.mcmaster.com/98173A200 , if I open the page directly in browser, I can view all the product data.
Because the data is in dynamically-loaded content, so I'm using Selenium + bs4.
if __name__ == "__main__":
url = "https://www.mcmaster.com/98173A200"
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Chrome("C:/chromedriver/chromedriver.exe", options=options)
driver.set_page_load_timeout(20)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
delay = 20
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'MainContent')))
except TimeoutException:
print("Timeout loading DOM!")
print(soup)
However, if I run the code I would get a login dialog, which I wouldn't get if I open this page directly in a browser like I mentioned.
I also tried logging in with the code below
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'Email')))
print("Page is ready!!")
input("Press Enter to continue...")
except TimeoutException:
print("Loading took too much time!")
email_input.send_keys(email)
password_input = driver.find_element_by_id('Password')
password_input.send_keys(password)
login_button = driver.find_element_by_class_name("FormButton_primaryButton__1kNXY")
login_button.click()
Then it shows access restricted.
I compared the requested header in the page opened by Selenium and the page in my browser, I couldn't find anything wrong. I also tried other webdrivers like PhantomJS and FireFox, and I got the same result.
I also tried using random user-agent using the code below
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names,
operating_systems=operating_systems,
limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
chrome_options = Options()
chrome_options.add_argument('user-agent=' + user_agent)
Still same result.
The developer tool in the page opened by Selenium showed there were a bunch of errors. I guess the tokenauthorization one is the key to this issue, but I don't know what should I do with it.
Any help would be appreciated!

The reason you saw a login window is that you were accessing McMaster carr via a chrome driver. When the server recognizes your behaviour, it will require you to sign in.
A typical login wouldn't work if you haven't been authenticated by McMaster (need to sign NDA)
You should look into McMaster API. With the API, you can access the database directly. However, you need to sign an NDA with McMaster Carr before obtaining access to the API. https://www.mcmaster.com/help/api/

Related

how do i get current page url with python selenium?

I want to know from which page the api request has come.
so I tried,
#test_bp.route("/test2")
def test2():
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(executable_path="D:\pathto/chromedriver.exe", options=options)
path = driver.current_url
print(path)
return ""
and got this for print(path)
data:,
I do get proper url if I put driver.get(URL) before getting path = driver.current_url. However, I can't put the url which I don't know at the moment dealing with the api request.
Would there be any other way? Or should I just giveup and ask additional page information from the client side?

Why does BeautifulSoup give me the wrong text?

I've been trying to get the availability status of a product on IKEA's website. On IKEA's website, it says in Dutch: 'not available for delivery', 'only available in the shop', 'not in stock' and 'you've got 365 days of warranty'.
But my code gives me: 'not available for delivery', 'only available for order and pickup', 'checking inventory' and 'you've got 365 days of warranty'.
What do I do wrong which causes the text to not be the same?
This is my code:
import requests
from bs4 import BeautifulSoup
# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
thepage = requests.get(url)
soup = BeautifulSoup(thepage.text, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {'class' : 'range-revamp-product-availability'})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)
With the help of Rajesh, I created this as the script that does exactly what I want. It goes to a certain shop (the one located in Heerlen) and it can check for any out of stock item when it comes back to stock and send you an email whenever it is back in stock.
The script used for this is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import smtplib, ssl
# Fill in the url of the product
url = 'https://www.ikea.com/nl/nl/p/vittsjo-stellingkast-zwartbruin-glas-20213312/'
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/Jem/Downloads/chromedriver')
# Stuff for sending the email
port = 465
password = 'password'
sender_email = 'email'
receiver_email = 'email'
message = """\
Subject: Product is back in stock!
Sent with Python. """
# Keep looping until back in stock
while True:
driver.get(url)
# Go to the location of the shop
btn = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="onetrust-accept-btn-handler"]')))
btn.click()
location = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="content"]/div/div/div/div[2]/div[3]/div/div[5]/div[3]/div/span[1]/div/span/a')))
location.click()
differentlocation = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[1]/div[2]/a')))
differentlocation.click()
searchbar = driver.find_element_by_xpath('//*[#id="change-store-input"]')
# In this part you can choose the location you want to check
searchbar.send_keys('heerlen')
heerlen = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="range-modal-mount-node"]/div/div[3]/div/div[2]/div/div[3]/div')))
heerlen.click()
selecteer = driver.find_element_by_xpath('//*[#id="range-modal-mount-node"]/div/div[3]/div/div[3]/button')
selecteer.click()
close = driver.find_element_by_xpath('//*[#id="range-modal-mount-node"]/div/div[3]/div/div[1]/button')
close.click()
# After you went to the right page, beautifulsoup it
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
# Check whether it is still out of stock, if so wait half an hour and continue
if 'Niet op voorraad in Heerlen' in availabilitysectiontext:
time.sleep(1800)
continue
# If not, send me an email that it is back in stock
else:
print('Email is being sent...')
context = ssl.create_default_context()
with smtplib.SMTP_SSL('smtp.gmail.com', port, context=context) as server:
server.login(sender_email, password)
server.sendmail(sender_email, receiver_email, message)
break
The page markup is getting added with javascript after the initial server response. BeautifulSoup is only able to see the initial response and doesn't execute javascript to get the complete response. If you want to run JavaScript, you'll need to use a headless browser. Otherwise, you'll have to disassemble the JavaScript and see what it does.
You could get this to work with Selenium. I modified your code a bit and got it to work.
Get Selenium:
pip3 install selenium
Download Firefox + geckodriver or Chrome + chromedriver:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
# Get the url of the IKEA page and set up the bs4 stuff
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
#uncomment the following line if using firefox + geckodriver
#driver = webdriver.Firefox(executable_path='/Users/ralwar/Downloads/geckodriver') # Downloaded from https://github.com/mozilla/geckodriver/releases
# using chrome + chromedriver
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path='/Users/ralwar/Downloads/chromedriver') # Downloaded from https://chromedriver.chromium.org/downloads
driver.get(url)
time.sleep(5) #adding delay to finish loading the page + javascript completely, you can adjust this
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
# Locate the part where the availability stuff is
availabilitypanel = soup.find('div', {"class" : "range-revamp-product-availability"})
# Get the text of the things inside of that panel
availabilitysectiontext = [part.getText() for part in availabilitypanel]
print(availabilitysectiontext)
The above code prints:
['Niet beschikbaar voor levering', 'Alleen beschikbaar in de winkel', 'Niet op voorraad in Amersfoort', 'Je hebt 365 dagen om van gedachten te veranderen. ']

How to save browser session using selenium python

I am trying to write a python script which can change language of youtube and next time when I open browser using webdriver it show me youtube page in language which I saved before. Problem is my script is not working as expected and I am not sure why?
Selecting language and saving cookies in following code
url='https://www.youtube.com'
print(url)
#page = requests.get(url)
#open web browser
browser = webdriver.Firefox()
#load specific url
browser.get(url)
#wait to load js
time.sleep(5)
#find language picker and click
browser.find_element_by_xpath('//*[#id="yt-picker-language-button"]').click()
#wait to open language list
time.sleep(2)
#find and click specific language
browser.find_element_by_xpath('//*[#id="yt-picker-language-footer"]/div[2]/form/div/div[1]/button[1]').click()
pickle.dump(browser.get_cookies() , open("youtubeCookies.pkl","wb"))
Loading data from cookies.
url='https://www.youtube.com'
print(url)
driver=webdriver.Firefox()
driver.get(url)
for cookie in pickle.load(open("youtubeCookies.pkl", "rb")):
driver.add_cookie(cookie)
time.sleep(3)
driver.refresh()
please guide me what I am doing wrong?
Thank you

How to get HTML code after logging in?

I am quite new to Selenium, it would be great if you guys can point me to the right direction.
I'm trying to access the HTML code of a website AFTER the login sequence.
I've used Selenium to direct the browser to initiate the login sequence, the part of the HTML I need will show up after I login. But when I tried to call the HTML code after the login sequence with page_source, it just gave me the HTML code for the site before logging in.
def test_script(ticker):
base_url = "http://amigobulls.com/stocks/%s/income-statement/quarterly" %ticker
driver = webdriver.Firefox()
verificationErrors = []
accept_next_alert = True
driver.get(base_url)
driver.maximize_window()
driver.implicitly_wait(30)
driver.find_element_by_xpath("//header[#id='header_cont']/nav/div[4]/div/span[3]").click()
driver.find_element_by_id("login_email").clear()
driver.find_element_by_id("login_email").send_keys(email)
driver.find_element_by_id("login_pswd").clear()
driver.find_element_by_id("login_pswd").send_keys(pwd)
driver.find_element_by_id("loginbtn").click()
amigo_script = driver.page_source

Retrieving url from google image search for first entry, using python and selenium

Ever since the API has been deprecated, its been very hard to retrieve the google image search url using Selenium. I've scoured stackoverflow, but most of the results to this question are from years ago when scraping search engines was simpler.
Looking for a way to return the url of the first image in a google search query. I've used everything in selenium from clicks, to retrieve innerhtml of elements, to my most recent attempt, using actionchains to attempt to navigate to the url of the pic and then returning the current url.
def GoogleImager(searchterm, musedict):
page = "http://www.google.com/"
landing = driver.get(page)
actions = ActionChains(driver)
WebDriverWait(landing, '10')
images = driver.find_element_by_link_text('Images').click()
actions.move_to_element(images)
searchbox = driver.find_element_by_css_selector('#lst-ib')
WebDriverWait(searchbox, '10')
sendsearch = searchbox.send_keys('{} "logo" {}'.format('Museum of Bad Art', 'bos')+Keys.ENTER)
WebDriverWait(sendsearch, '10')
logo = driver.find_element_by_xpath('//*[#id="rg_s"]/div[1]/a').click()
WebDriverWait(logo, '10')
logolink = driver.find_element_by_xpath('//*[#id="irc_cc"]/div[3]/div[1]/div[2]/div[2]/a')
WebDriverWait(logolink, '10')
actions.move_to_element(logolink).click(logolink)
print(driver.current_url)
return driver.current_url
I'm using this to return the first image for a museum name and city in the search.
I tried to make your code work with Google, got frustrated and switched to Yahoo instead. I couldn't make heads or tails of your musedict access loops so I substituted a simple dictionary for demonstration purposes:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
museum_dictionary = { "louvre": "Paris", "prado": "Madrid"}
driver = webdriver.Firefox()
def YahooImager(searchterm):
page = "https://images.search.yahoo.com"
landing = driver.get(page)
WebDriverWait(driver, 4)
assert "Yahoo Image Search" in driver.title
searchbox = driver.find_element_by_name("p") # Find the query box
city = museum_dictionary[searchterm]
searchbox.send_keys("{} {}".format(searchterm, city) + Keys.RETURN)
WebDriverWait(driver, 4)
try:
driver.find_element_by_xpath('//*[#id="resitem-0"]/a').click()
except NoSuchElementException:
assert 0, '//*[#id="resitem-0"]/a'
driver.close()
WebDriverWait(driver, 4)
try:
driver.find_element_by_link_text("View Image").click()
except NoSuchElementException:
assert 0, "View Image"
driver.close()
WebDriverWait(driver, 4)
# driver.close()
return driver.current_url
image_url = YahooImager("prado")
print(repr(image_url))
It works, but takes quite a while. (That's probably something someone who knows these libraries better could optimize -- I just wanted to see it work at all.) This example is fragile and occasionally just fails.

Categories

Resources