Issue while web scraping with Selenium and Python

Issue while web scraping with Selenium and Python - python

I'm trying to scrape this website
https://maroof.sa/BusinessType/BusinessesByTypeList?bid=14&sortProperty=BestRating&DESC=True
There is a button to load more content when you click it, it displays more content without changing the URL
I had made a piece of code to load all the content first then extract all the URLs of the data I need then go to each link and scrape the data
url = "https://maroof.sa/BusinessType/BusinessesByTypeList?bid=26&sortProperty=BestRating&DESC=True"
driver = webdriver.Chrome()
driver.get(url)
# button = driver.find_element_by_xpath('//*[#id="loadMore"]/button')
num = 1
while num <= 507:
sleep(4)
button = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="loadMore"]/button')))
button.click()
print(num)
num += 1
links = [l.get_attribute('href') for l in WebDriverWait(driver, 40).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[#id="list"]/a')))]
it seems to work but sometimes it doesn't click on the button that loads the content it accidentally
click on something else and makes an error and i have to start over again
Can you help me?

To scrape the website clicking on the button to load more content you need to induce WebDriverWait for the element_to_be_clickable() and you can use the following Locator Strategy:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://maroof.sa/BusinessType/BusinessesByTypeList?bid=26&sortProperty=BestRating&DESC=True')
while True:
try:
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//button[#class='btn btn-primary']"))))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[#class='btn btn-primary']"))).click()
except TimeoutException:
break
print([l.get_attribute('href') for l in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[#id="list"]/a')))])
driver.quit()

Related

How to scrape Next button on Linkedin with Selenium using Python?

I am trying to scrape LinkedIn website using Selenium. I can't parse Next button. It resists as much as it can. I've spent a half of a day to adress this, but all in vain.
I tried absolutely various options, with text and so on. Only work with start ID but scrape other button.
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//button[#aria-label='Далее']"}
This is quite common for this site:
//*[starts-with(#id,'e')]
My code:
from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from time import sleep
chrome_driver_path = Service("E:\programming\chromedriver_win32\chromedriver.exe")
driver = webdriver.Chrome(service=chrome_driver_path)
url = "https://www.linkedin.com/feed/"
driver.get(url)
SEARCH_QUERY = "python developer"
LOGIN = "EMAIL"
PASSWORD = "PASSWORD"
sleep(10)
sign_in_link = driver.find_element(By.XPATH, '/html/body/div[1]/main/p[1]/a')
sign_in_link.click()
login_input = driver.find_element(By.XPATH, '//*[#id="username"]')
login_input.send_keys(LOGIN)
sleep(1)
password_input = driver.find_element(By.XPATH, '//*[#id="password"]')
password_input.send_keys(PASSWORD)
sleep(1)
enter_button = driver.find_element(By.XPATH, '//*[#id="organic-div"]/form/div[3]/button')
enter_button.click()
sleep(25)
lens_button = driver.find_element(By.XPATH, '//*[#id="global-nav-search"]/div/button')
lens_button.click()
sleep(5)
search_input = driver.find_element(By.XPATH, '//*[#id="global-nav-typeahead"]/input')
search_input.send_keys(SEARCH_QUERY)
search_input.send_keys(Keys.ENTER)
sleep(5)
people_button = driver.find_element(By.XPATH, '//*[#id="search-reusables__filters-bar"]/ul/li[1]/button')
people_button.click()
sleep(5)
page_button = driver.find_element(By.XPATH, "//button[#aria-label='Далее']")
page_button.click()
sleep(60)
Chrome inspection of button
Next Button

OK, there are several issues here:
The main problem why your code not worked is because the "next" pagination is initially even not created on the page until you scrolling the page, so I added the mechanism, to scroll the page until that button can be clicked.
it's not good to create locators based on local language texts.
You should use WebDriverWait expected_conditions explicit waits, not hardcoded pauses.
I used mixed locators types to show that sometimes it's better to use By.ID and sometimes By.XPATH etc.
the following code works:
import time
from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=webdriver_service)
wait = WebDriverWait(driver, 10)
url = "https://www.linkedin.com/feed/"
driver.get(url)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#href,'login')]"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "username"))).send_keys(my_email)
wait.until(EC.element_to_be_clickable((By.ID, "password"))).send_keys(my_password)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()
search_input = wait.until(EC.element_to_be_clickable((By.XPATH, "//input[contains(#class,'search-global')]")))
search_input.click()
search_input.send_keys("python developer" + Keys.ENTER)
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[#id="search-reusables__filters-bar"]/ul/li[1]/button'))).click()
wait = WebDriverWait(driver, 4)
while True:
try:
next_btn = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "button.artdeco-pagination__button.artdeco-pagination__button--next")))
next_btn.location_once_scrolled_into_view
time.sleep(0.2)
next_btn.click()
break
except:
driver.execute_script("window.scrollBy(0, arguments[0]);", 600)

How to access the iframe using Selenium and Python on Glassdoor

I am trying to automate application process on Glassdoor using the EasyApply button. Now after identifying the EasyApply Button and successfully clicking it, I need to switch to the Frame so as to access the HTML content of the form to be able to send in my form details.
I used:
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#indeedapply-modal-preload-1658752913396-iframe")))
to perform the switching but still could not access the frame's html content.
Here is the block that performs this operation:
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#if it has easy-apply, then perform application
if len(driver.find_elements_by_xpath('//*[#id="JDCol"]/div/article/div/div[1]/div/div/div[1]/div[3]/div[2]/div/div[1]/div[1]/button')) > 0:
driver.find_element_by_xpath('//*[#id="JDCol"]/div/article/div/div[1]/div/div/div[1]/div[3]/div[2]/div/div[1]/div[1]/button').click()
wait = WebDriverWait(driver, 50)
time.sleep(5)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#indeedapply-modal-preload-1658752913396-iframe")))
name = driver.find_element(By.CSS_SELECTOR, '#input-applicant.name')
name.send_keys('Oluyele Anthony')
elif len(driver.find_elements_by_xpath('//*[#id="JDCol"]/div/article/div/div[1]/div/div/div[1]/div[3]/div[2]/div/div[1]/div[1]/a')) > 0:
driver.find_element_by_xpath('//*[#id="JDCol"]/div/article/div/div[1]/div/div/div[1]/div[3]/div[2]/div/div[1]/div[1]/a').click()
Here is the HTML content for the frame
<iframe name="indeedapply-modal-preload-1659146630884-iframe" id="indeedapply-modal-preload-1659146630884-iframe" scrolling="no" frameborder="0" title="Job application form container" src="https://apply.indeed.com/indeedapply/xpc?v=5#%7B%22cn%22:%224AaHlXdnW4%22,%22ppu%22:%22https://www.glassdoor.com/robots.txt%22,%22lpu%22:%22https://apply.indeed.com/robots.txt%22,%22setupms%22:1659146630959,%22preload%22:true,%22iaUid%22:%221g96dgr7uii3h800%22,%22parentURL%22:%22https://www.glassdoor.com/Job/nigeria-data-science-jobs-SRCH_IL.0,7_IN177_KO8,20.htm?clickSource=searchBox%22%7D" style="border: 0px; vertical-align: bottom; width: 100%; height: 100%;"></iframe>
Apparently, after running the cell, the:
TimeoutException: Message:
Error occurs which shows that the frame is not being switched to.

The middle part of the value of id attribute i.e. 1658752913396 is dynamically generated and is bound to change sooner/later. They may change next time you access the application afresh or even while next application startup. So can't be used in locators.
Solution
To switch to the <iframe> you can use either of the following Locator Strategies:
Using id attribute:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='Job application form container'][id^='indeedapply-modal-preload']")))
Using XPATH:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[title='Job application form container' and starts-with(#id, 'indeedapply-modal-preload')]")))
Using name attribute:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='Job application form container'][name^='indeedapply-modal-preload']")))
Using XPATH:
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[title='Job application form container' and starts-with(#name, 'indeedapply-modal-preload')]")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

There are 2 iframes in that page and the second iframe contains another (nested) iframe. The following code will sort out your problem (setup is for linux, but you can figure it out - pay attention to imports, and to the code after getting the url):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://www.glassdoor.com/Job/nigeria-data-science-jobs-SRCH_IL.0,7_IN177_KO8,20.htm?clickSource=searchBox'
browser.get(url)
button = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[#data-test="applyButton"]')))
button.click()
iframes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, "iframe")))
print(len(iframes))
for iframe in iframes:
print(iframe.get_attribute('id'))
browser.switch_to.frame(iframes[1])
t.sleep(2)
WebDriverWait(browser, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//*[#title='Job application form']")))
t.sleep(2)
applicant_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, 'input-applicant.name')))
applicant_name.click()
applicant_name.send_keys('hello dolly')
Let me know if it works for you.
UPDATE: It appears that website changed its structure since yesterday, and now there are 11/13 iframes per page. Also, if the job selected from the main column does not offer Easy Apply, you get an error. The updated code (below) is selecting the Easy Apply jobs from page, goes through each of them, clicks easyapply button, select the correct iframe/nested iframe, and does stuff to that form. It currently works, until that website will change its format again:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://www.glassdoor.com/Job/nigeria-data-science-jobs-SRCH_IL.0,7_IN177_KO8,20.htm?clickSource=searchBox'
browser.get(url)
jobs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//li[#data-is-easy-apply='true']")))
while True:
for job in jobs:
try:
print(job.text)
job.click()
try:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[#alt='Close']"))).click()
except Exception as e:
print('no request to login')
t.sleep(2)
button = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//button[#data-test="applyButton"]')))
button.click()
WebDriverWait(browser, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//*[#title='Job application form container']")))
t.sleep(2)
WebDriverWait(browser, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//*[#title='Job application form']")))
t.sleep(2)
applicant_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, 'input-applicant.name')))
applicant_name.click()
applicant_name.send_keys('hello dolly')
t.sleep(2)
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[#id='form-action-cancel']"))).click()
t.sleep(2)
browser.switch_to.default_content()
t.sleep(2)
except Exception as e:
print(e)
break

Can you click the reCapatcha 'Audio Challenge' Button Using Selenium + Python? [duplicate]

I want to, click on the button to resolve the captcha through the audio, but selenium does not detect the specified "id".
browser.get("https://www.google.com/recaptcha/api2/demo")
mainWin = browser.current_window_handle
iframe = browser.find_elements_by_tag_name("iframe")[0]
browser.switch_to_frame(iframe)
CheckBox = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID ,"recaptcha-anchor"))).click()
sleep(4)
audio = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID ,"recaptcha-audio-button"))).click()

To click() on the button to resolve the captcha through the audio as the desired elements are within an <iframe> so you have to:
Induce WebDriverWait for the desired frame to be available and switch to it.
Induce WebDriverWait for the desired element to be clickable.
You can use the following Locator Strategies:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.google.com/recaptcha/api2/demo")
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[src^='https://www.google.com/recaptcha/api2/anchor']")))
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span#recaptcha-anchor"))).click()
driver.switch_to.default_content()
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='recaptcha challenge']")))
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#recaptcha-audio-button"))).click()
Browser Snapshot:
Reference
Ways to deal with #document under iframe
Outro
You can find a couple of relevant discussions in:
How to click on the reCaptcha using Selenium and Java
CSS selector for reCaptcha checkbok using Selenium and vba excel
Find the reCAPTCHA element and click on it — Python + Selenium

Very useful, just put your attention, that text: 'recaptcha challenge' in selector below depends from regional settings/language:
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='recaptcha challenge']")))

"NoSuchElementException" error when searching for html element with selenium [duplicate]

I want to, click on the button to resolve the captcha through the audio, but selenium does not detect the specified "id".
browser.get("https://www.google.com/recaptcha/api2/demo")
mainWin = browser.current_window_handle
iframe = browser.find_elements_by_tag_name("iframe")[0]
browser.switch_to_frame(iframe)
CheckBox = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID ,"recaptcha-anchor"))).click()
sleep(4)
audio = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID ,"recaptcha-audio-button"))).click()

To click() on the button to resolve the captcha through the audio as the desired elements are within an <iframe> so you have to:
Induce WebDriverWait for the desired frame to be available and switch to it.
Induce WebDriverWait for the desired element to be clickable.
You can use the following Locator Strategies:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.google.com/recaptcha/api2/demo")
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[src^='https://www.google.com/recaptcha/api2/anchor']")))
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span#recaptcha-anchor"))).click()
driver.switch_to.default_content()
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='recaptcha challenge']")))
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#recaptcha-audio-button"))).click()
Browser Snapshot:
Reference
Ways to deal with #document under iframe
Outro
You can find a couple of relevant discussions in:
How to click on the reCaptcha using Selenium and Java
CSS selector for reCaptcha checkbok using Selenium and vba excel
Find the reCAPTCHA element and click on it — Python + Selenium

Very useful, just put your attention, that text: 'recaptcha challenge' in selector below depends from regional settings/language:
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe[title='recaptcha challenge']")))

How to click on the Next button with doPostBack() call to browse to the next page while fetching data with Selenium and Python?

I have written this code but its not going on the next page its fetching data from the same page repeatedly.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver import ActionChains
url="http://www.4docsearch.com/Delhi/Doctors"
driver = webdriver.Chrome(r'C:\chromedriver.exe')
driver.get(url)
next_page = True
while next_page == True:
soup = BeautifulSoup(driver.page_source, 'html.parser')
div = soup.find('div',{"id":"ContentPlaceHolder1_divResult"})
for heads in div.find_all('h2'):
links = heads.find('a')
print(links['href'])
try:
driver.find_element_by_xpath("""//* [#id="ContentPlaceHolder1_lnkNext"]""").click()
except:
print ('No more pages')
next_page=False
driver.close()

To browse to the Next page as the desired element is a JavaScript enabled element with __doPostBack() you have to:
Induce WebDriverWait for the staleness_of() the element first.
Induce WebDriverWait for the element_to_be_clickable() the element next.
You can use the following Locator Strategies:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("http://www.4docsearch.com/Delhi/Doctors")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[#id='ContentPlaceHolder1_lnkNext' and not(#class='aspNetDisabled')]"))).click()
while True:
try:
WebDriverWait(driver, 20).until(EC.staleness_of((driver.find_element_by_xpath("//a[#id='ContentPlaceHolder1_lnkNext' and not(#class='aspNetDisabled')]"))))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[#id='ContentPlaceHolder1_lnkNext' and not(#class='aspNetDisabled')]"))).click()
print ("Next")
except:
print ("No more pages")
break
print ("Exiting")
driver.quit()
Console Output
Next
Next
Next
.
.
.
No more pages
Exiting

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue while web scraping with Selenium and Python - python

Related

How to scrape Next button on Linkedin with Selenium using Python?

How to access the iframe using Selenium and Python on Glassdoor

Can you click the reCapatcha 'Audio Challenge' Button Using Selenium + Python? [duplicate]

"NoSuchElementException" error when searching for html element with selenium [duplicate]

How to click on the Next button with doPostBack() call to browse to the next page while fetching data with Selenium and Python?

Categories

Resources