Struggling to click the load more button with Selenium - python

I plan to build a scraper that'll utilize both Selenium and BeautifulSoup.
I'm struggling to click the load more button with selenium. I've managed to detect the button, scroll to it etc. - can't seem to figure out a way to continuously click the button.
Any suggestions on how to pass this hurdle?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time, requests
from bs4 import BeautifulSoup
def search_agent(zip):
location = bot.find_element_by_name('hheroquotezip')
time.sleep(3)
location.clear()
location.send_keys(zip)
location.submit()
def load_all_agents():
# click more until no more results to load
while True:
try:
#more_button = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'results.length'))).click()
more_button = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="searchResults"]/div[3]/button'))).click()
except TimeoutException:
break
# wait for results to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.seclection-result .partners-detail')))
print ("Complete")
bot.quit()
#define Zip for search query
zip = 20855
bot = webdriver.Safari()
wait = WebDriverWait(bot, 10)
#fetch agents page
bot.get('https://www.erieinsurance.com/find-an-insurance-agent')
search_agent(zip)
load_all_agents()
With the above approach, the console spits out these errors:
[Error] Refused to load https://9203275.fls.doubleclick.net/activityi;src=9203275;type=agent0;cat=agent0;ord=7817740349177;gtm=2wg783;auiddc=373080108.1594822533;~oref=https%3A%2F%2Fwww.erieinsurance.com%2Ffind-an-insurance-agent-results%3Fzipcode%3D20855? because it does not appear in the frame-src directive of the Content Security Policy.
[Error] Refused to connect to https://api.levelaccess.net/analytics/3.0/results because it does not appear in the connect-src directive of the Content Security Policy.

Creating an answer to post a couple of images.
When i ran the attached script in chrome it worked fine.
When #furas did the same in firefox he had the same result
I ran the same script 10 times back to back and i wasn't refused.
What i note based on the error is that iframe seems broswer sensitive:
In Chrome this header contains chromium scripts:
In Firefox it contains no scripts:
Have a look and see what you get manually in your safari.
A simple answer might be to not use safari - use chrome or FF. Is that an option? (if it MUST be safari just say and i'll look again.)
Finally - couple of quick additional notes.
The site is using angular, so you might want to consider protractor if you're struggling with synchronisation. (protractor helps with some script-syncing capailies)
Also worth a note - don't feel you have to land on the home page and then navigate as user. Update your URL to the search results page and feed in the zip code and save yourself some time:
https://www.erieinsurance.com/find-an-insurance-agent-results?zipcode=20855
[edit/update]
This the same thing? https://github.com/SeleniumHQ/selenium/issues/458
Closed bug in 2016 around "Content Security Policies" - logged as an apple thing.

Related

Unable to Open .htm link with Selenium

I cannot open the the link described in the picture with selenium.
I have tried to find element by css_selector, link, partial link, xpath. Still no success, program shows no error, but does not click the last link. Here is the picture from the inspect code from the sec website. Picture of Inspect Code. The line of code that wants to open this is in bold.
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
PATH = "C:\Program Files (x86)\Misc Programs\chromedriver.exe"
stock = 'KO'
#stock = input("Enter stock ticker: ")
browser = webdriver.Chrome(PATH)
#First SEC search
sec_url = 'https://www.sec.gov/search/search.htm'
browser.get(sec_url)
tikSearch = browser.find_element_by_css_selector('#cik')
tikSearch.click()
tikSearch.send_keys(stock)
Sclick = browser.find_element_by_css_selector('#searchFormDiv > form > fieldset > span > input[type=submit]')
Sclick.click()
formDesc = browser.find_element_by_css_selector('#seriesDiv > table > tbody > tr:nth-child(2) > td:nth-child(1)')
print(formDesc)
doc = browser.find_element_by_css_selector('#documentsbutton')
doc.click()
##Cannot open file
**form = browser.find_element_by_xpath('//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a')
form.click()**
uClient = uReq(sec_url)
page_html = uClient.read()```
On Firefox this worked and got https://www.sec.gov/Archives/edgar/data/21344/000002134421000018/a20201231crithrifplan.htm
Pasting that into Chrome directly also works.
But in the script, it indeed did not open and left one stuck at:
https://www.sec.gov/Archives/edgar/data/21344/000002134421000018/0000021344-21-000018-index.htm
where, oddly, clicking on the link by hand works in the browser that Selenium launched.
It's better with a wait, but if I put time.sleep(5) before your
form = browser.find_element_by_xpath('//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a')
it opens in Chrome.
EDIT: And here it is done properly with no sleep:
wait = WebDriverWait(browser, 20)
wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a'))).click()
This assumes you have the imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Possibly useful addition:
I am surprised there is no Selenium Test Helper out there with methods that wrap in some bulletproofing (or maybe there are and I do not know), like what Hetzner Cloud did in its Protractor Test Helper. So I wrote my own little wrapper method for the click (also for send keys, which calls this one). If it's useful to you or readers, enjoy. It could be enhanced to build in retries or take the wait time or whether to scroll the field into the top or bottom of the window (or at all) as parameters. It is working in my context as is.
def safe_click(driver, locate_method, locate_string):
"""
Parameters
----------
driver : webdriver
initialized browser object
locate_method : Locator
By.something
locate_string : string
how to find it
Returns
-------
WebElement
returns whatever click() does.
"""
wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_element_located((locate_method, locate_string)))
driver.execute_script("arguments[0].scrollIntoView(false);",
driver.find_element(locate_method, locate_string))
return wait.until(EC.element_to_be_clickable((locate_method, locate_string))).click()
If you use it, then the call (which I just tested and it worked) would be:
safe_click(browser, By.XPATH, '//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a')
You could be using it elsewhere, too, but it does not seem like there is a need.

How do I test every link on a webpage with Selenium using Pytho and pytest or Selenium Firefox IDE?

So I'm trying to learn Selenium for automated testing. I have the Selenium IDE and the WebDrivers for Firefox and Chrome, both are in my PATH, on Windows. I've been able to get basic testing working but this part of the testing is eluding me. I've switched to using Python because the IDE doesn't have enough features, you can't even click the back button.
I'm pretty sure this has been answered elsewhere but none of the recommended links provided an answer that worked for me. I've searched Google and YouTube with no relevant results.
I'm trying to find every link on a page, which I've been able to accomplish, even listing the I would think this would be just a default test. I even got it to PRINT the text of the link but when I try to click the link it doesn't work. I've tried doing waits of various sorts, including
visibility_of_any_elements_located AND time.sleep(5) To wait before trying to click the link.
I've tried this to click the link after waiting self.driver.find_element(By.LINK_TEXT, ("lnktxt")).click(). But none work, not in below code, the below code works, listing the URL Text, the URL and the URL Text again, defined by a variable.
I guess I'm not sure how to get a variable into the By.LINK_TEXT or ...by_link_text statement, assuming that would work. I figured if I got it into the variable I could use it again. That worked for print but not for click()
I basically want to be able to load a page, list all links, click a link, go back and click the next link, etc.
The only post this site recommended that might be helpful was...
How can I test EVERY link on the WEBSITE with Selenium
But it's Java based and I've been trying to learn Python for the past month so I'm not ready to learn Java just to make this work. The IDE does not seem to have an easy option for this, or from all my searches it's not documented well.
Here is my current Selenium code in Python.
import pytest
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
wait_time_out = 15
class TestPazTestAll2():
def setup_method(self, method):
driver = webdriver.Firefox()
self.driver = webdriver.Firefox()
self.vars = {}
def teardown_method(self, method):
self.driver.quit()
def test_pazTestAll(self):
self.driver.get('https://poetaz.com/poems/')
lnks=self.driver.find_elements_by_tag_name("a")
print ("Total Links", len(lnks))
# traverse list
for lnk in lnks:
# get_attribute() to get all href
print(lnk.get_attribute("text"))
lnktxt = (lnk.get_attribute("text"))
print(lnk.get_attribute("href"))
print(lnktxt)
driver.quit()
Again, I'm sure I missed something in my searches but after hours of searching I'm reaching out.
Any help is appreciated.
I basically want to be able to load a page, list all links, click a link, go back and click the next link, etc.
I don't recommend doing this. Selenium and manipulating the browser is slow and you're not really using the browser for anything where you'd really need a browser.
What I recommend is simply sending requests to those scraped links and asserting response status codes.
import requests
link_elements = self.driver.find_elements_by_tag_name("a")
urls = map(lambda l: l.get_attribute("href"), link_elements)
for url in urls:
response = requests.get(url)
assert response.status_code == 200
(You also might need to prepend some base url to those strings found in href attributes.)

no such element: Unable to locate element using chromedriver and Selenium in production environment

I have a problem with selenium chromedriver which I cannot figure out what's causing it. Some weeks ago everything was working OK, and suddenly this error started to show up.
The problem is coming from the following function.
def login_(browser):
try:
browser.get("some_url")
# user credentials
user = browser.find_element_by_xpath('//*[#id="username"]')
user.send_keys(config('user'))
password = browser.find_element_by_xpath('//*[#id="password"]')
password.send_keys(config('pass'))
login = browser.find_element_by_xpath('/html/body/div[1]/div/button')
login.send_keys("\n")
time.sleep(1)
sidebar = browser.find_element_by_xpath('//*[#id="sidebar"]/ul/li[1]/a')
sidebar.send_keys("\n")
app_submit = browser.find_element_by_xpath('//*[#id="sidebar"]/ul/li[1]/ul/li[1]/a')
app_submit.send_keys("\n")
except TimeoutException or NoSuchElementException:
raise LoginException
This function works with no problem in the development environment (macOS 10.11), but throws the following error in the production environment:
Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="sidebar"]/ul/li[1]/a"}
(Session info: headless chrome=67.0.3396.79)
(Driver info: chromedriver=2.40.565383 (76257d1ab79276b2d53ee97XXX),platform=Linux 4.4.0-116-generic x86_64)
I already updated both Chrome and chromedriver (v67 & 2.40, respectively) in each environment. I also gave it more time.sleep(15). But the problem persists. My latest guess is that maybe the initialization of the webdriver is not working properly:
def initiate_webdriver():
option = webdriver.ChromeOptions()
option.binary_location = config('GOOGLE_CHROME_BIN')
option.add_argument('--disable-gpu')
option.add_argument('window-size=1600,900')
option.add_argument('--no-sandbox')
if not config('DEBUG', cast=bool):
display = Display(visible=0, size=(1600, 900))
display.start()
option.add_argument("--headless")
else:
option.add_argument("--incognito")
return webdriver.Chrome(executable_path=config('CHROMEDRIVER_PATH'), chrome_options=option)
Because, if the Display is not working, then there may not be the mentioned sidebar but some other button.
So my questions are: does anybody have had a similar issue? Is there a way to know what is the page showing at the time the driver is looking for such an element?
It's report that the element not found error after you supplying the login , so I think the login failed and the page redirected to somewhere. You can use screenshot option to take a screenshot of the page and then see which page the driver load.
driver.save_screenshot("path to save screen.jpeg")
Also you can save the raw html code and inspect the same page.
Webdriver Screenshot
Using Selenium in Python to save a webpage on Firefox
A couple of things as per the login_(browser) method:
As you have identified the Login button through:
login = browser.find_element_by_xpath('/html/body/div[1]/div/button')
I would suggest rather invoking send_keys("\n") take help of the onclick() event through login.click() to mock the clicking of Login button as follows:
login = browser.find_element_by_xpath('/html/body/div[1]/div/button')
login.click()
Next when you identify the sidebar induce WebDriverWait for the element to be clickable as follows:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="sidebar"]/ul/li[1]/a'))).click()
As you mentioned your code code block works perfect in macOS 10.11 environment but throws the following error in the production environment (Linux) it is highly possible that different browsers renders the HTML DOM differently in different OS architecture. So instead of absolute xpath you must use relative xpath as follows:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[#attribute='value']"))).click()
A couple of things as per the initiate_webdriver() method:
As per Getting Started with Headless Chrome the argument --disable-gpu is applicable only for Windows but not a valid configuration for Linux OS. So need o remove:
option.add_argument('--disable-gpu')
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Whenever I encounter strange issues in Selenium like this, I prefer retrying to find the particular element which is causing intermittent troubles. One way is to wrap it around a try-except block:
try:
sidebar = browser.find_element_by_xpath('//*[#id="sidebar"]/ul/li[1]/a')
except NoSuchElementException:
time.sleep(10)
print("Unable to find element in first time, trying it again")
sidebar = browser.find_element_by_xpath('//*[#id="sidebar"]/ul/li[1]/a')
You could also put the try code in a loop with a suitable count variable to make the automation code work. (Check this). In my experience with JAVA, this idea has resolved multiple issues.
You need to wait until the element is visible or else you will get this error. Try something like this:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.common.by import By
TIMEOUT = 5
...
xpath = '//*[#id="sidebar"]/ul/li[1]/a'
WebDriverWait(self.selenium, TIMEOUT).until(visibility_of_element_located((By.XPATH, xpath)))
browser.find_element_by_xpath(xpath)
...

Python - How to Disable an Extension, After Opening a Chrome Window, with Selenium

I am trying to disable AdBlock for a specific website only, but I can't find a way to do it. I tried looking in the selenium documentation, but I couldn't find any methods to disable extensions afterward. However, I am still pretty new at reading documentation, so I may have missed something. I also tried to automate the disabling of the AdBlock extension by using selenium but it didn't work. The plan was to go to the extension section of chrome(chrome://extensions/), get the "enabled" checkbox and click it without my intervention. Here is my attempt:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
def main():
opening = True
while opening:
try:
chrome_options = Options()
#Path to AdBlock
chrome_options.add_extension('/usr/local/bin/AdBlock_v.crx')
driver = webdriver.Chrome(chrome_options=chrome_options)
except:
print('An unkown error has occured. Trying again...')
else:
opening = False
disable_adblocker(driver)
def click_element(driver, xpath, index):
getting = True
not_found_times = 0
while getting:
try:
getting = False
element = WebDriverWait(driver, 5).until(
EC.presence_of_all_elements_located((By.XPATH,xpath)))[index]
element.click()
#driver.get(element.get_attribute("href"))
#In case the page does not load properly
except TimeoutException:
not_found_times += 1
if not_found_times < 2:
driver.refresh()
getting = True
else:
raise
#In case DOM updates which makes elements stale
except StaleElementReferenceException:
getting = True
def disable_adblocker(driver):
driver.get('chrome://extensions')
ad_blocker_xpath = '//div[#id="gighmmpiobklfepjocnamgkkbiglidom"]//div[#class="enable-controls"]//input'
click_element(driver,ad_blocker_xpath,0)
print('')
main()
The reason my attempt failed is because selenium couldn't use the xpath, I specified, to get the checkbox element. I believe the path is correct.
The only solution that I can think of is creating two chrome windows: one with AdBlock and another without AdBlock. However, I don't want two windows as this will make things more complicated.
It doesn't look like this is possible using any settings in selenium. However... You can automate adding the domain you wish to exclude after creating the driver.
Before your test actually starts, but after you've initialized the browser, navigate to chrome-extension://[your AdBlock extention ID]/options.html. AdBlock extension ID is unique to the crx file. So go into chrome and find the value in the extension manager. For example, mine is gighmmpiobklfepjocnamgkkbiglidom.
After you've navigated to that page, click 'Customize', then 'Show ads everywhere except for these domains...', then input the domain into the field, then click 'OK'. Boom! Now the domain is added and will show ads! Just make sure
I know its not the ideal quick, easy, one line of code solution... But it seems like the best option, unless you want to go digging in the local storage files and find where this data is added to...

scrape websites with infinite scrolling

I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
Step 3 : Print the data if required.
Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape...
Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.
Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.

Categories

Resources