I'm attempting to run a selenium script locally and open three non-headless browsers. I'm using multiprocesing Pools (and have tried with just regular multiprocessing as well) and come across an interesting issue where 3 browser sessions open, but only the first one actually navigates to the target_url and attempts control. The other two just sit and wait and do nothing.
Here is the execution code that's relevant
run_id = str(uuid.uuid4())
options = Options()
#options.binary_location = '/opt/headless-chromium' #works for lambda
start_time = time.time()
options.binary_location = '/usr/bin/google-chrome' #testing
#options.add_argument('--headless') don't need headless
options.add_argument('--no-sandbox')
options.add_argument('--verbose')
#options.add_argument('--single-process')
options.add_argument('--user-data-dir=/tmp/user-data')# test add
options.add_argument('--data-path=/tmp/data-path')
options.add_argument('--disk-cache-dir=/tmp/cache-dir')
options.add_argument('--homedir=/tmp')
#options.add_argument('--disable-gpu')# test add
#options.add_argument("--remote-debugging-port=9222") test remove
#options.add_argument('--disable-dev-shm-usage')
#'/opt/chromedriver' not found
logger.info("Before driver initiated")
# job_id = event['job_id']
# run_id = event['run_id']
send_log(job_id, run_id, "JOB START", True, "", time.time() - start_time)
retries = 0
drivers = []
try:
#driver = webdriver.Chrome('/opt/chromedriver_89', chrome_options=options)
driver = webdriver.Chrome('/opt/chromedriver90', chrome_options=options)
#driver2 = webdriver.Chrome('/opt/chromedriver90', chrome_options=options)
break
except Exception as e:
print(str(e))
logger.info('exception with driver instantiation, retrying... ' + str(e))
#time.sleep(5)
driver = None
driver = webdriver.Chrome('/opt/chromedriver', chrome_options=options)
....
and here is how i'm invoking each process
from multiprocessing import Pool
pool = Pool(processes=3)
for i in range(3):
pool.apply_async(invoke, args=("https://macau-flash-sale.myshopify.com/",))
pool.close()
pool.join()
Is it possible that despite the multiple processes, selenium is not communicating with the other two browser instances properly?
Any ideas are greatly appreciated!
I don't think selenium can handle multiple browsers at once. I recommend writing 3 different python scripts and running them all at once through the terminal. If they need to communicate information to each other, probably the easiest way is to just get them to write to a text file.
For anyone else struggling with this, Selenium CAN be used with async/multiprocessing.
But you cannot specify the same user/data directories for each session. I had to remove these parameters so that each chrome session creates unique directories for iteself.
just remove the below and it'll work.
options.add_argument('--user-data-dir=/tmp/user-data')# test add
options.add_argument('--data-path=/tmp/data-path')
Related
I am trying to use some explicit waits using Undetected Chromedriver (v2). Rather than executing the statements once the element had loaded, etc it appears to pause until the wait time expires.
When I use the normal selenium chromedriver everything works as expected ("opt-in" is closed in 1-2 seconds) and when I use sleeps instead of waits the statements are executed much quicker.
Can anyone see the problem?
Here's the code:
class My_Chrome(uc.Chrome):
def __del__(self):
pass
options = uc.ChromeOptions()
arguments = [
'--log-level=3', '--no-first-run', '--no-service-autorun', '--password-store=basic',
'--start-maximized',
'--window-size=1920, 1080',
'--credentials_enable_service=False',
'--profile.password_manager_enabled=False,'
'--add_experimental_option("detach", True)'
]
for argument in arguments:
options.add_argument(argument)
driver = My_Chrome(options=options)
wait = WebDriverWait(driver, 20)
driver.get('https://www.oddschecker.com')
try:
opt_in = wait.until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Not Now']/..")))
VirtualClick(driver, opt_in)
current_time('Closing opt-in')
except:
pass
I'm trying to web scrape this page https://www.bumeran.com.pe/empleos-publicacion-menor-a-7-dias.html
I use a code to get the URLs I need to get data from. However, the usual code takes about 2 hours to run, so I want to optimize it. I read I could use multi-threading, but I'm having issues implementing it.
My simplified code is as follows:
import concurrent.futures
import time
from selenium import webdriver
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
PATH = "C:\Program Files (x86)\chromedriver.exe"
wd = webdriver.Chrome(PATH, options=options)
wd.maximize_window()
def info(url):
wd.get(url)
try:
wd.implicitly_wait(20)
titulo = wd.find_element_by_xpath("//h1").text
empresa = wd.find_element_by_class_name('FichaAvisoSubHeader__Company-sc-1poii24-2.bkFDjf').text
detalle = wd.find_element_by_class_name('FichaAviso__DescripcionAviso-sc-b2a7hd-11.kjUXkd').text
print('URL completada')
except Exception as ex:
titulo=''
empresa=''
detalle=''
template = "An exception of type {0} occurred. Arguments:\n{1!r}"
message = template.format(type(ex).__name__, ex.args)
print(message)
vtitulo.append(titulo)
vempr.append(empresa)
vdetalle.append(detalle)
vtitulo = []
vempr = []
vdetalle = []
vurl = ['https://www.bumeran.com.pe/empleos/asistente-contable-exp.-en-concar-ssp-1114585777.html',
'https://www.bumeran.com.pe/empleos/asesor-a-comercial-digital-de-seguro-vehicular-1114584904.html',
'https://www.bumeran.com.pe/empleos/mecanico-de-mantenimiento-arequipa-1114585709.html',
'https://www.bumeran.com.pe/empleos/almacenero-l.o.-electronics-s.a.c.-1114585629.html',
'https://www.bumeran.com.pe/empleos/analista-de-comunicaciones-ingles-avanzado-teleperformance-peru-s.a.c.-1114564863.html',
'https://www.bumeran.com.pe/empleos/vendedores-adn-retail-s.a.c.-1114585422.html',
'https://www.bumeran.com.pe/empleos/especialista-de-intervencion-de-proyectos-mondelez-international-1114585461.html',
'https://www.bumeran.com.pe/empleos/desarrollador-java-senior-inetum-peru-1114584840.html',
'https://www.bumeran.com.pe/empleos/practicante-legal-coes-sinac-1114584788.html',
'https://www.bumeran.com.pe/empleos/concurso-publico-n-143-especialista-en-presupuesto-banco-central-de-reserva-del-peru-1114584538.html']
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
executor.map(info, vurl)
print("--- %i seconds ---" % (time.time() - start_time))
In some URLs, the result sends me to the StaleElementReferenceException, so I think my problem is that it's running too fast that the page doesn't finish loading. I've tried to introduce waits but that doesn't seem to work. Also, I don't know if I'm correctly using the Threading. I thought that it would open 4 Chrome browsers, but it doesn't seem to do so.
Any help is welcomed :)
Thanks in advance and sorry if it's a dumb question
I have a crawling process that kicks off selenium in a custom class that looks like this:
class BrowserInterface:
def __init__(self, base_url, proxy_settings):
self.base_url = base_url
self.display = Display(visible=0, size=(1024, 768))
self.display.start()
proxy_argument = '--proxy-server={0}'.format(PROXY_URL.format(
proxy_settings.get('proxy_host'),
proxy_settings.get('proxy_port')
))
logger.debug(proxy_argument)
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument(proxy_argument)
selenium_chrome_driver_path = os.path.join(settings.DEFAULT_DRIVER_PATH,
settings.CHROME_DRIVERS[settings.CURRENT_OS])
self.driver = webdriver.Chrome(executable_path=selenium_chrome_driver_path, chrome_options=options)
def visit(self, url):
url = urljoin(self.base_url, url)
self.driver.get(url)
def body(self):
soup = BeautifulSoup(self.driver.page_source)
return soup.find("body").text
def quit(self):
self.driver.quit()
self.display.stop()
This BrowserInterface class is initialized in a batch queue and the quit() method is called at the end of the batch. There are no issues starting chrome and getting the data. The trouble is, at the end of each job when the quit() method is called chrome goes into zombie mode. When the next BrowserInterface is initialized it starts a new chrome instance. Due to this, the box is running out of memory. I've tried running the a kill command as well on the chrome process but it stays running. Any direction would be greatly appreciated as I'm about to pull my hair out over this.
Running on Ubuntu 18.04, Google Chrome 70.0.3538.110, ChromeDriver 2.44, Python3.6.6
Thanks in advance!
From your code trials it is pretty much evident you have invoked self.driver.quit() which should have worked perfect.
However as the box is still running out of memory due to zombie chrome processes you took the right approach to execute the a kill command and you can add the following solution within the quit() method:
from selenium import webdriver
import psutil
driver = webdriver.Chrome()
driver.get('http://google.com/')
PROCNAME = "chrome" # to clean up zombie Chrome browser
#PROCNAME = "chromedriver" # to clean up zombie ChromeDriver
for proc in psutil.process_iter():
# check whether the process name matches
if proc.name() == PROCNAME:
proc.kill()
Please see: https://stackoverflow.com/a/49756925
Creating bash as the root process allows for better 'zombie handling'. Python isn't meant to be run as a top level process which causes the zombies.
Now, I am using selenium to execute script "window.performance.timing" to get the full load time of a page. It can run without opening a browser. I want this keep running 24x7 and return the loading time.
Here is my code:
import time
from selenium import webdriver
import getpass
u=getpass.getuser()
print(u)
# # initialize Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('user-data-dir=C:\\Users\\%s\\AppData\\Local\\Google\\Chrome\\User Data'%(u))
source = "https://na66.salesforce.com/5000y00001SgXm0?srPos=0&srKp=500"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(source)
driver.find_element_by_id("Login").click()
while True:
try:
navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
domComplete = driver.execute_script("return window.performance.timing.domComplete")
loadEvent = driver.execute_script("return window.performance.timing. loadEventEnd")
onloadPerformance = loadEvent - navigationStart
print("%s--It took about: %s" % (time.ctime(), onloadPerformance))
driver.refresh()
except TimeoutException:
print("It took too long")
driver.quit()
I have two questions:
Is it a good idea to keep refresh the page and print the page loading time? Does it have any risk?
Anything needs to get improvement for my code?
Someone suggested using docker and jerkin when I searched any suggestions on Google, but it will need to download more things. This code will be package into exe in the end and let others to use. It will be good if it does not acquire many software packages.
Thank you very much as I am a fresh man in the web side. Any suggestions will be appreciated.
TL/DR: Right now it launches 2 browsers but only runs the test in 1. What am I missing?
So I'm trying to get selenium hub working on a mac (OS X 10.11.5). I installed with this, then launch hub in a terminal tab with:
selenium-standalone start -- -role hub
Then in another tab of terminal on same machine register a node.
selenium-standalone start -- -role node -hub http://localhost:4444/grid/register -port 5556
It shows up in console with 5 available firefox and chrome browsers.
So here's my code. In a file named globes.py I have this.
class globes:
def __init__(self, number):
self.number = number
base_url = "https://fake-example.com"
desired_cap = []
desired_cap.append ({'browserName':'chrome', 'javascriptEnabled':'true', 'version':'', 'platform':'ANY'})
desired_cap.append ({'browserName':'firefox', 'javascriptEnabled':'true', 'version':'', 'platform':'ANY'})
selenium_server_url = 'http://127.0.0.1:4444/wd/hub'
Right now I'm just trying to run a single test that looks like this.
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from globes import *
class HeroCarousel(unittest.TestCase):
def setUp(self):
for driver_instance in globes.desired_cap:
self.driver = webdriver.Remote(
command_executor=globes.selenium_server_url,
desired_capabilities=driver_instance)
self.verificationErrors = []
def test_hero_carousel(self):
driver = self.driver
driver.get(globes.base_url)
hero_carousel = driver.find_element(By.CSS_SELECTOR, 'div.carousel-featured')
try: self.assertTrue(hero_carousel.is_displayed())
except AssertionError, e: self.verificationErrors.append("home_test: Hero Carousel was not visible")
def tearDown(self):
self.driver.close()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
Right now it launches both Firefox and Chrome, but only runs the test in Firefox. Chrome opens and just sits on a blank page and doesn't close. So I figure there's something wrong with how I wrote the test. So what am I missing? I apologize if this is obvious but I'm just learning how to setup hub and just learned enough python to write selenium tests a couple weeks ago.
I think Hubs working as it launches both, but I did try adding a second node on the same machine on a different port and got the same thing. Just in case here's what hub prints out.
INFO - Got a request to create a new session: Capabilities [{browserName=chrome, javascriptEnabled=true, version=, platform=ANY}]
INFO - Available nodes: [http://192.168.2.1:5557]
INFO - Trying to create a new session on node http://192.168.2.1:5557
INFO - Trying to create a new session on test slot {seleniumProtocol=WebDriver, browserName=chrome, maxInstances=5, platform=MAC}
INFO - Got a request to create a new session: Capabilities [{browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]
INFO - Available nodes: [http://192.168.2.1:5557]
INFO - Trying to create a new session on node http://192.168.2.1:5557
INFO - Trying to create a new session on test slot {seleniumProtocol=WebDriver, browserName=firefox, maxInstances=5, platform=MAC}
Forgive me if I am way off as I haven't actually worked with selenium, this answer is purely based on the issue related to only keeping the reference to the last created driver in setUp
Instead of keeping one self.driver you need to have a list of all drivers, lets say self.drivers, then when dealing with them instead of driver = self.driver you would do for driver in self.drivers: and indent all the relevent code into the for loop, something like this:
class HeroCarousel(unittest.TestCase):
def setUp(self):
self.drivers = [] #could make this with list comprehension
for driver_instance in globes.desired_cap:
driver = webdriver.Remote(
command_executor=globes.selenium_server_url,
desired_capabilities=driver_instance)
self.drivers.append(driver)
self.verificationErrors = []
def test_hero_carousel(self):
for driver in self.drivers:
driver.get(globes.base_url)
hero_carousel = driver.find_element(By.CSS_SELECTOR, 'div.carousel-featured')
try: self.assertTrue(hero_carousel.is_displayed())
except AssertionError, e: self.verificationErrors.append("home_test: Hero Carousel was not visible")
def tearDown(self):
for driver in self.drivers:
driver.close()
self.assertEqual([], self.verificationErrors)
You need to use self.driver.quit() because otherwise the browser will not quit and will only close the current window.
You will soon end-up with multiple browser running, and you will have to pay for them.