How to get rid of block while scraping with Selenium

How to get rid of block while scraping with Selenium - python

I'm trying to scrape a website with Selenium, but I think it's blocking this access in many ways.
The error message shown is: "selenium.common.exceptions.NoSuchWindowException: Message: Browsing context has been discarded" but sometimes is shown an error saying that time for loading page had expired
Furthermore, Firefox is consuming a huge percent of CPU and Memory when loading this page.
I've already tried to change user-agent, or run it headlessly, but no results.
Below is the code:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://www.bet365.com/#/HO/')
matches = browser.find_elements_by_class_name('him-Fixture')
browser.quit()
Any tips to bypass it ?

Sometimes browsers loading late . So you add time.sleep() function in you code.
Example :
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.bet365.com/#/HO/')
time.sleep(5)
matches = browser.find_elements_by_class_name('him-Fixture')
browser.quit()

Related

Starting selenium program in a manually opened browser

I am using selenium and python in order to scrape data on a website.
The problem is I need to manually log in because there is a CAPTCHA after the login.
My question is the following : is there a way to start the program on a page that is already loaded ? (for example, here I would log to the website, solve the CAPTCHA manually, and then launch the program that would scrape the data)
Note: I have already been looking for an answer on SO but did not find it, might have missed it as it seems to be an obvious question.

don't open in headless mode. open in head mode.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.headless = False # Set false here
driver = webdriver.Chrome(options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get("http://google.com/")
print ("Headless Chrome Initialized")
time.sleep(30) # wait 30 seconds, this should give enough time to manually do the capture
# do other code here
driver.quit()

Struggling to click the load more button with Selenium

I plan to build a scraper that'll utilize both Selenium and BeautifulSoup.
I'm struggling to click the load more button with selenium. I've managed to detect the button, scroll to it etc. - can't seem to figure out a way to continuously click the button.
Any suggestions on how to pass this hurdle?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time, requests
from bs4 import BeautifulSoup
def search_agent(zip):
location = bot.find_element_by_name('hheroquotezip')
time.sleep(3)
location.clear()
location.send_keys(zip)
location.submit()
def load_all_agents():
# click more until no more results to load
while True:
try:
#more_button = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'results.length'))).click()
more_button = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="searchResults"]/div[3]/button'))).click()
except TimeoutException:
break
# wait for results to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.seclection-result .partners-detail')))
print ("Complete")
bot.quit()
#define Zip for search query
zip = 20855
bot = webdriver.Safari()
wait = WebDriverWait(bot, 10)
#fetch agents page
bot.get('https://www.erieinsurance.com/find-an-insurance-agent')
search_agent(zip)
load_all_agents()
With the above approach, the console spits out these errors:
[Error] Refused to load https://9203275.fls.doubleclick.net/activityi;src=9203275;type=agent0;cat=agent0;ord=7817740349177;gtm=2wg783;auiddc=373080108.1594822533;~oref=https%3A%2F%2Fwww.erieinsurance.com%2Ffind-an-insurance-agent-results%3Fzipcode%3D20855? because it does not appear in the frame-src directive of the Content Security Policy.
[Error] Refused to connect to https://api.levelaccess.net/analytics/3.0/results because it does not appear in the connect-src directive of the Content Security Policy.

Creating an answer to post a couple of images.
When i ran the attached script in chrome it worked fine.
When #furas did the same in firefox he had the same result
I ran the same script 10 times back to back and i wasn't refused.
What i note based on the error is that iframe seems broswer sensitive:
In Chrome this header contains chromium scripts:
In Firefox it contains no scripts:
Have a look and see what you get manually in your safari.
A simple answer might be to not use safari - use chrome or FF. Is that an option? (if it MUST be safari just say and i'll look again.)
Finally - couple of quick additional notes.
The site is using angular, so you might want to consider protractor if you're struggling with synchronisation. (protractor helps with some script-syncing capailies)
Also worth a note - don't feel you have to land on the home page and then navigate as user. Update your URL to the search results page and feed in the zip code and save yourself some time:
https://www.erieinsurance.com/find-an-insurance-agent-results?zipcode=20855
[edit/update]
This the same thing? https://github.com/SeleniumHQ/selenium/issues/458
Closed bug in 2016 around "Content Security Policies" - logged as an apple thing.

Code stops executing after opening browser through webdriver using python

I have been trying to open multiple browser windows in internet explorer using webdriver in selenium. Once it reaches the get(url) line, it just halts there and eventually times out. I've added a print line, which does not execute. I've tried various methods and the one below is the Ie version of code I used to open multiple tabs in Chrome. Even if I remove the first 3 lines, it still only goes up to opening google.com. I've looked googled this issue and looked through other posts but nothing has helped. Would really appreciate any advice, thanks!
options = webdriver.IeOptions()
options.add_additional_option("detach", True)
driver = webdriver.Ie(options = options, executable_path=r'blahblah\IEDriverServer.exe')
driver.get("http://google.com")
print("syrfgf")
driver.execute_script("window.open('about:blank', 'tab2');")
driver.switch_to.window("tab2")
driver.get("http://yahoo.com")

You need to replace the url you have provided:
http://google.com
with a proper url as follows:
https://www.google.com/
Which should be represented as per the syntax diagram as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://github.com")
signin_link = driver.find_element(By.LINK_TEXT, "Sign in")
signin_link.click()
time.sleep(1)
user = driver.find_element(By.ID, "login_field")
user.send_keys("X")
passw = driver.find_element(By.ID, "password")
passw.send_keys("X")
passw.submit()
time.sleep(5)
driver.close()
I had this issue and writing this code seems to have made it work flawlessly. Adjust the sleep time as you want it. Putting my chromedriver.exe into my project folder also helped with some errors

How to get a new page after selenium times out on indefinitely loading page

I've come across a problematic page that causes Selenium Chrome (selenium version 3.10.0 in python 3, chromedriver Version 2.35.528157) on MacOSX to time out, I think because there is something indefinitely loading on the page. The problem is that after that timeout, all future requests to the driver to .get() a new url also fail with a timeout, even if they worked before. In fact, observing the browser it is never sent to the new url. This, of course, renders the browser useless for further sessions.
How can I "reset" the driver so that I can carry on using it? Or failing that, how can I debug why the .get() command doesn't seem to work after visiting the problematic page. The code and my output are below (problematic page is http://coastalpathogens.wordpress.com/2012/11/25/onezoom/: I'd be interested if other people see the same thing, and with other pages too
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
browser = webdriver.Chrome()
browser.set_page_load_timeout(10)
browser.implicitly_wait(1)
for link in ("http://www.google.com", "http://coastalpathogens.wordpress.com/2012/11/25/onezoom/","http://www.google.com"):
try:
print("getting {}".format(link))
browser.get(link)
print("done!")
except TimeoutException:
print("Timed out")
continue
result:
getting http://www.google.com
done!
getting http://coastalpathogens.wordpress.com/2012/11/25/onezoom/
Timed out
getting http://www.google.com
Timed out

As per your question and your own code block I have executed your own code tweaking a few ChromeDriver settings through the chrome.options class as below and it works perfecto :
Code Block :
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
browser = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
browser.set_page_load_timeout(10)
for link in ("http://www.google.com", "http://coastalpathogens.wordpress.com/2012/11/25/onezoom/","http://www.google.com"):
try:
print("getting {}".format(link))
browser.get(link)
print("done!")
except TimeoutException:
print("Timed out")
continue
Console Output :
getting http://www.google.com
done!
getting http://coastalpathogens.wordpress.com/2012/11/25/onezoom/
done!
getting http://www.google.com
done!
Issue at your end and the solution
There are a couple of things which you need to consider as follows :
Until and unless your usecase have a constraint on Page Load Timeout you must not use set_page_load_timeout() as on slow networks, while invoking urls e.g. http://coastalpathogens.wordpress.com/2012/11/25/onezoom/ the Browser Client may require more then 10 seconds (i.e. the configured time through set_page_load_timeout(10)) to send document.readyState equal to "complete" to Selenium.
If your usecase have a dependency on Page Load Timeout, catch the exception and invoke quit() to shutdown gracefully as follows :
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\path\to\chromedriver.exe')
driver.set_page_load_timeout(2)
try :
driver.get("https://www.booking.com/hotel/in/the-taj-mahal-palace-tower.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYATG4AQbIAQzYAQHoAQH4AQKSAgF5qAID;sid=338ad58d8e83c71e6aa78c67a2996616;dest_id=-2092174;dest_type=city;dist=0;group_adults=2;hip_dst=1;hpos=1;room1=A%2CA;sb_price_type=total;srfid=ccd41231d2f37b82d695970f081412152a59586aX1;srpvid=c71751e539ea01ce;type=total;ucfs=1&#hotelTmpl")
print("URL successfully Accessed")
driver.quit()
except :
print("Page load Timeout Occured. Quiting !!!")
driver.quit()
Console Output :
Page load Timeout Occured. Quiting !!!
You can find a detailed discussion on set_page_load_timeout() in How to set the timeout of 'driver.get' for python selenium 3.8.0?
Consider replacing the usage of implicitly_wait() by ExplicitWait. Modern websites uses JavaScript, Ajax Calls and React Native where WebDriverWait will come into play and you can't mix up implicitly_wait() with WebDriverWait().

PhantomJS does not navigate to the next page after clicking a link

I am having a strange issue with PhantomJS or may be I am newbie. I am trying to login on NewEgg.com via Selenium by using PhantomJS. I am using Python for it. Issue is, when I use Firefox as a driver it works well but as soon as I set PhantomJS as a driver it does not go to next page hence give message:
Exception Message: u'{"errorMessage":"Unable to find element with id \'UserName\'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"89","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:55372","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\\"using\\": \\"id\\", \\"sessionId\\": \\"aaff4c40-6aaa-11e4-9cb1-7b8841e74090\\", \\"value\\": \\"UserName\\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/aaff4c40-6aaa-11e4-9cb1-7b8841e74090/element"}}' ; Screenshot: available via screen
The reason I found after taking screenshot that phantom could not navigate the page and script got finished. How do I sort this out? Code Snippet I tried given below:
import requests
from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
import datetime
my_username = "user#mail.com"
my_password = "password"
driver = webdriver.PhantomJS('/Setups/phantomjs-1.9.7-macosx/bin/phantomjs')
firefox_profile = webdriver.FirefoxProfile()
#firefox_profile.set_preference('permissions.default.stylesheet', 2)
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
#driver = webdriver.Firefox(firefox_profile)
driver.set_window_size(1120, 550)
driver.get('http://newegg.com')
driver.find_element_by_link_text('Log in or Register').click()
driver.save_screenshot('screen.png')
I even put sleep but it is not making any difference.

I experienced this with PhantomJS when the content type of the second page is not correct. A normal browser would just interpret the content dynamically, but Phantom just dies, silently.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get rid of block while scraping with Selenium - python

Sometimes browsers loading late . So you add time.sleep() function in you code. Example : from selenium import webdriver import time browser = webdriver.Firefox() browser.get('https://www.bet365.com/#/HO/') time.sleep(5) matches = browser.find_elements_by_class_name('him-Fixture') browser.quit()

Related

Starting selenium program in a manually opened browser

Struggling to click the load more button with Selenium

Code stops executing after opening browser through webdriver using python

How to get a new page after selenium times out on indefinitely loading page

PhantomJS does not navigate to the next page after clicking a link

Categories

Resources