Python Selenium. How to use driver.set_page_load_timeout() properly? - python

from selenium import webdriver
driver = webdriver.Chrome()
driver.set_page_load_timeout(7)
def urlOpen(url):
try:
driver.get(url)
print driver.current_url
except:
return
Then I have URL lists and call above methods.
if __name__ == "__main__":
urls = ['http://motahari.ir/', 'http://facebook.com', 'http://google.com']
# It doesn't print anything
# urls = ['http://facebook.com', 'http://google.com', 'http://motahari.ir/']
# This prints https://www.facebook.com/ https://www.google.co.kr/?gfe_rd=cr&dcr=0&ei=3bfdWdzWAYvR8geelrqQAw&gws_rd=ssl
for url in urls:
urlOpen(url)
The problem is when website 'http://motahari.ir/' throws Timeout Exception, websites 'http://facebook.com' and 'http://google.com' always throw Timeout Exception.
Browser keeps waiting for 'motahari.ir/' to load. But the loop just goes on (It doesn't open 'facebook.com' but wait for 'motahari.ir/') and keep throwing timeout exception
Initializing a webdriver instance takes long, so I pulled that out of the method and I think that caused the problem. Then, should I always reinitialize webdriver instance whenever there's a timeout exception? And How? (Since I initialized driver outside of the function, I can't reinitialize it in except)

You will just need to clear the browser's cookies before continuing. (Sorry, I missed seeing this in your previous code)
from selenium import webdriver
driver = webdriver.Chrome()
driver.set_page_load_timeout(7)
def urlOpen(url):
try:
driver.get(url)
print(driver.current_url)
except:
driver.delete_all_cookies()
print("Failed")
return
urls = ['http://motahari.ir/', 'https://facebook.com', 'https://google.com']
for url in urls:
urlOpen(url)
Output:
Failed
https://www.facebook.com/
https://www.google.com/?gfe_rd=cr&dcr=0&ei=o73dWfnsO-vs8wfc5pZI
P.S. It is not very wise to do try...except... without a clear Exception type, is this might mask different unexpected errors.

Related

get() missing 1 required positional argument: 'url'

i keep getting error of "get() missing 1 required positional argument: 'url'" when running following code
import selenium.webdriver as webdriver
def get_results(search_term):
url = "https://www.google.com"
browser = webdriver.Chrome
browser.get(url)
search_box = browser.find_element_by_class_name('gLFyf gsfi')
search_box.send_keys(search_term)
search_box.submit()
try:
links = browser.find_element_by_xpath('//ol[#class="web_regular_results"]//h3//a')
except:
links = browser.find_element_by_xpath('//h3//a')
results = []
for link in links :
href = link.get_attribute('href')
print(href)
results.append(href)
browser.close()
return results
get_results('dog')
the code is supposed to return search results of 'dog' from google, but gets stuck on
browser.get(url)
all help is appreciated
This issue is in the assignment of browser, browser = webdriver.Chrome. It needs to be browser = webdriver.Chrome().
In your code you are not assigning an instance of the chrome webdriver to browser, but the class itself. Thus when you call def get(self, url), your url parameter gets assigned to self and the argument url is not supplied, hence the positional argument error.
You should modify a bit of your code.
Change the import statement:
from selenium import webdriver
Also, you need to create chrome driver instance and provide the path of chromedriver jar file.
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")

How to end a session in Selenium and start a new one?

I'm trying to quit the browser session and start a new one when I hit an exception. Normally I wouldn't do this, but in this specific case it seems to make sense.
def get_info(url):
browser.get(url)
try:
#get page data
business_type_x = '//*[#id="page-desc"]/div[2]/div'
business_type = browser.find_element_by_xpath(business_type_x).text
print(business_type)
except Exception as e:
print(e)
#new session
browser.quit()
return get_info(url)
This results in this error: http.client.RemoteDisconnected: Remote end closed connection without response
I expected it to open a new browser window with a new session. Any tips are appreciated. Thanks!
You need to create the driver object again once you quite that. Initiate the driver in the get_info method again.
You can replace webdriver.Firefox() with whatever driver you are using.
def get_info(url):
browser = webdriver.Firefox()
browser.get(url)
try:
#get page data
business_type_x = '//*[#id="page-desc"]/div[2]/div'
business_type = browser.find_element_by_xpath(business_type_x).text
print(business_type)
except Exception as e:
print(e)
#new session
browser.quit()
return get_info(url)
You can also use close method instead of quit. So that you do not have to recreate the browser object.
def get_info(url):
browser.get(url)
try:
#get page data
business_type_x = '//*[#id="page-desc"]/div[2]/div'
business_type = browser.find_element_by_xpath(business_type_x).text
print(business_type)
except Exception as e:
print(e)
#new session
browser.close()
return get_info(url)
difference between quit and close can be found in the documentation as well.
quit
close
This error message...
http.client.RemoteDisconnected: Remote end closed connection without response
...implies that the WebDriver instance i.e. browser was unable to communicate with the Browsing Context i.e. the Web Browsing session.
If your usecase is to keep on trying to invoke the same url in a loop till the desired element is getting located you can use the following solution:
def get_info(url):
while True:
browser.get(url)
try:
#get page data
business_type_x = '//*[#id="page-desc"]/div[2]/div'
business_type = browser.find_element_by_xpath(business_type_x).text
print(business_type)
break
except NoSuchElementException as e:
print(e)
continue

ReadTimoutError with opening multiple webdrivers in selenium

My problem arises when multiple instances of the selenium webdriver get launched. I tried several things like changing the changing the method of requesting and going with and without headless, however the problem still remains. My program tries to parallelize the selenium webdriver and automate web interaction. Could somebody please help me to resolve this issue, either by handling the error or changing the code so that the error does not occur anymore. Thanks in advance.
if url:
options = Options()
# options.headless = True
options.set_preference('dom.block_multiple_popups', False)
options.set_preference('dom.popup_maximum', 100000000)
driver = webdriver.Firefox(options=options)
driver.set_page_load_timeout(30)
pac = dict()
try:
# driver.get(url)
# driver.execute_script('''window.location.href = '{0}';'''.format(url))
driver.execute_script('''window.location.replace('{0}');'''.format(url))
WebDriverWait(driver, 1000).until(lambda x: self.onload(pac, driver))
pac['code'] = 200
except ReadTimeoutError as ex:
pac['code'] = 404
print("Exception has been thrown. " + str(ex))
return pac
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=61322): Read timed out. (read timeout=)

Can't make my script fetch desired content using proxies

I've written a script in python in combination with selenium using proxies to get the text of differnt links populated upon navigating to a url, as in this one. What I want to parse from there is the visible text connected to each link.
The script I've tried so far with is capable of producing new proxies when this function start_script() is called within it. The problem is that the very url lead me to this redirected link. I can get rid off this redirection only when I keep trying on until the url is satisfied with a proxy. My current script can try twice only with two new proxies.
How can I use any loop within get_texts() function in such a way so that it will keep trying using new proxies until it parses the required content?
My attempt so far:
import requests
import random
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
link = 'http://www.google.com/search?q=python'
def get_proxies():
response = requests.get('https://www.us-proxy.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxies
def start_script():
proxies = get_proxies()
random.shuffle(proxies)
proxy = next(cycle(proxies))
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(chrome_options=chrome_options)
return driver
def get_texts(url):
driver = start_script()
driver.get(url)
if "index?continue" not in driver.current_url:
for item in [items.text for items in driver.find_elements_by_tag_name("h3")]:
print(item)
else:
get_texts(url)
if __name__ == '__main__':
get_texts(link)
The code below works well for me, however it can't help you with bad proxies. It also loops through the list of proxies and tries one until it succeeds or the list runs out.
It prints which proxy it uses so that you can see that it tries more than one time.
However as https://www.us-proxy.org/ points out:
What is Google proxy? Proxies that support searching on Google are
called Google proxy. Some programs need them to make large number of
queries on Google. Since year 2016, all the Google proxies are dead.
Read that article for more information.
Article:
Google Blocks Proxy in 2016 Google shows a page to verify that you are
a human instead of the robot if a proxy is detected. Before the year
2016, Google allows using that proxy for some time if you can pass
this human verification.
from contextlib import contextmanager
import random
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
def get_proxies():
response = requests.get('https://www.us-proxy.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
random.shuffle(proxies)
return proxies
# Only need to fetch the proxies once
PROXIES = get_proxies()
#contextmanager
def proxy_driver():
try:
proxy = PROXIES.pop()
print(f'Running with proxy {proxy}')
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(options=chrome_options)
yield driver
finally:
driver.close()
def get_texts(url):
with proxy_driver() as driver:
driver.get(url)
if "index?continue" not in driver.current_url:
return [items.text for items in driver.find_elements_by_tag_name("h3")]
print('recaptcha')
if __name__ == '__main__':
link = 'http://www.google.com/search?q=python'
while True:
links = get_texts(link)
if links:
break
print(links)
while True:
driver = start_script()
driver.get(url)
if "index?continue" in driver.current_url:
continue
else:
break
This will loop until index?continue is not in the url, and then break out of the loop.
This answer only addresses your specific question - it doesn't address the problem that you might be creating a large number of web drivers, but you never destroy the unsused / failed ones. Hint: you should.

How to handle TimeoutException in selenium, python

First of all, I created several functions to use them instead of default "find_element_by_..." and login() function to create "browser". This is how I use it:
def login():
browser = webdriver.Firefox()
return browser
def find_element_by_id_u(browser, element):
try:
obj = WebDriverWait(browser, 10).until(
lambda browser : browser.find_element_by_id(element)
)
return obj
#########
driver = login()
find_element_by_link_text_u(driver, 'the_id')
Now I use such tests through jenkins(and launch them on a virtual machine). And in case I got TimeoutException, browser session will not be killed, and I have to manually go to VM and kill the process of Firefox. And jenkins will not stop it's job while web browser process is active.
So I faced the problem and I expect it may be resoved due to exceptions handling.
I tryed to add this to my custom functions, but it's not clear where exactly exception was occured. Even if I got line number, it takes me to my custom function, but not the place where is was called:
def find_element_by_id_u(browser, element):
try:
obj = WebDriverWait(browser, 1).until(
lambda browser : browser.find_element_by_id(element)
)
return obj
except TimeoutException, err:
print "Timeout Exception for element '{elem}' using find_element_by_id\n".format(elem = element)
print traceback.format_exc()
browser.close()
sys.exit(1)
#########
driver = login()
driver .get(host)
find_element_by_id_u('jj_username').send_keys('login' + Keys.TAB + 'passwd' + Keys.RETURN)
This will print for me the line number of string "lambda browser : browser.find_element_by_id(element)" and it's useles for debugging. In my case I have near 3000 rows, so I need a propper line number.
Can you please share your expirience with me.
PS: I divided my program for few scripts, one of them contains only selenium part, that's why I need login() function, to call it from another script and use returned object in it.
Well, spending some time in my mind, I've found a proper solution.
def login():
browser = webdriver.Firefox()
return browser
def find_element_by_id_u(browser, element):
try:
obj = WebDriverWait(browser, 10).until(
lambda browser : browser.find_element_by_id(element)
)
return obj
#########
try:
driver = login()
find_element_by_id_u(driver, 'the_id')
except TimeoutException:
print traceback.format_exc()
browser.close()
sys.exit(1)
It was so obvious, that I missed it :(

Categories

Resources