Hi I try to open parallely multiple instances of Chrome in python using Webdriver and Multiprocessing.
After running processes, instances are opening smoothly, but they are not sent to my "instance" array and I can't access instances after that. Please help me, there is my code:
from selenium import webdriver
from multiprocessing import Process
import time
num = 3
process = [None] * num
instance = [None] * num
def get():
for i in range(num):
try:
instance[i].get("https://www.youtube.com")
except:
print("Can't connect to the driver")
time.sleep(1)
get()
def create_instance(i):
instance[i] = webdriver.Chrome()
if __name__ == '__main__':
for i in range(num):
process[i] = Process(target = create_instance, args = [i])
process[i].start()
for i in range(num):
process[i].join()
get()
when the multiprocessing try to pickle the webdriver object, it'll occur some weird error, so instead of passing the object, we can pass the class and build the object inside the new process.
BUT, in that kind of situation, you can not access the driver instances anymore, maybe you can try to send signals to the process.
from selenium import webdriver
from multiprocessing import Process
import time
num = 3
process = [None] * num
def get(id, Driver):
driver = Driver()
driver.get(f"https://www.google.com?id={id}")
time.sleep(10)
driver.close()
if __name__ == '__main__':
for i in range(num):
process[i] = Process(target=get, args = [i, webdriver.Chrome])
process[i].start()
for i in range(num):
process[i].join()
Related
This question already has an answer here:
Basic python threading is not working. What am I missing in this?
(1 answer)
Closed 1 year ago.
I found this simple example demonstrating how to use threading to parallelize opening multiple chrome sessions with selenium.
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Chrome()
url = 'https://www.google.de'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print(t.name + ' started!')
thread_list.append(t)
# Wait for all threads to complete
for thread in thread_list:
thread.join()
print('Test completed!')
I tested it and it works. However if I modify the test_logic function to include a variable, i.e. j:
def test_logic(j):
driver = webdriver.Chrome()
url = 'https://www.google.de'
driver.get(url)
# Implement your test logic
time.sleep(j)
driver.quit()
and the corresponding part of threading to:
t = threading.Thread(name='Test {}'.format(i), target=test_logic(i))
the code will stop working in parallel and just runs sequentially.
I don't know what I might have not considered and therefore will be very grateful if anybody can give me some advices. Many thanks!
target=test_logic(i) is invoking the function test_logic and give the return value to the thread.
You may want to do:
t = threading.Thread(name='Test {}'.format(i), target=test_logic, args=[i])
where target is the name of the function, and args is the arguments list for the function.
If you function has 2 args, like def test_logic(a,b), the args should contain 2 values.
More info in Python Thread Documentation
You have to pass arguments to function as below:
t = threading.Thread(name='Test {}'.format(i), target=test_logic, args=(i,))
I'm trying to build a multithreading selenium scraper. Let's say I want to get 100.000 websites and print their page sources, using 20 ChromeDriver instances. By now, I have the following code:
from queue import Queue
from threading import Thread
from selenium import webdriver
from numpy.random import randint
selenium_data_queue = Queue()
worker_queue = Queue()
# Start 20 ChromeDriver instances
worker_ids = list(range(20))
selenium_workers = {i: webdriver.Chrome() for i in worker_ids}
for worker_id in worker_ids:
worker_queue.put(worker_id)
def selenium_task(worker, data):
# Open website
worker.get(data)
# Print website page source
print(worker.page_source)
def selenium_queue_listener(data_queue, worker_queue):
while True:
url = data_queue.get()
worker_id = worker_queue.get()
worker = selenium_workers[worker_id]
# Assign current worker and url to your selenium function
selenium_task(worker, url)
# Put the worker back into the worker queue as it has completed it's task
worker_queue.put(worker_id)
data_queue.task_done()
return
if __name__ == '__main__':
selenium_processes = [Thread(target=selenium_queue_listener,
args=(selenium_data_queue, worker_queue)) for _ in worker_ids]
for p in selenium_processes:
p.daemon = True
p.start()
# Adding urls indefinitely to data queue
# Generating random url just for testing
for i in range(100000):
d = f'http://www.website.com/{i}'
selenium_data_queue.put(d)
# Wait for all selenium queue listening processes to complete
selenium_data_queue.join()
# Tearing down web workers
for b in selenium_workers.values():
b.quit()
My question is: if any ChromeDriver abruptly shuts down (i.e. non-recoverable exception like InvalidSessionIdException), is there a way to remove it from the worker queue and insert a new ChromeDriver in its place, so that I still have 20 usable instances? If so, there's a good pratice to accomplish it?
I am writing a code in Tkinter with a button that starts 6 selenium chrome instances, each with a different url. The goal is : the drivers have to be initiated in the fastest possible way and every driver instance has to reenter its specific url (refresh) every 4 seconds. Every driver is attached to a class, that contains the wanted url. I tried this with threads:
import tkinter as tk
import threading
import os
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browsers = []
class Browser:
def __init__(self, url, session_file):
self.url = url
self.session_file = session_file
def manipulate_browser(browser):
browser.driver = webdriver.Chrome(ChromeDriverManager().install())
while True:
browser.driver.get(browser.url)
time.sleep(4)
def start_browsers():
for browser in browsers:
browser.thread = threading.Thread(target=manipulate_browser, args=(browser,))
browser.thread.start()
if __name__=='__main__':
lock = threading.Lock()
threads = []
urls = 'https://google.com', 'https://facebook.com', 'https://instagram.com', 'https://snapchat.com', 'https://stackoverflow.com', 'https://amazon.com', 'https://microsoft.com'#, 'https://stackoverflow.com', 'https://youtube.com', 'https://yahoo.com'
for url in urls:
session_file = 'session_' + ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(10))
newBrowser = Browser(url, session_file)
browsers.append(newBrowser)
root = tk.Tk()
button_start_browsers = tk.Button(root, command=start_browsers, width=50, height=4, bg='red', text='Start Browsers')
button_start_browsers.pack()
And this works just fine, but I want to add some options and capabilities to the driver, in the manipulate_browser function . Like so:
import tkinter as tk
import threading
import os
import time
import random, string
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from multiprocessing.pool import ThreadPool
browsers = []
class Browser:
def __init__(self, url, session_file):
self.url = url
self.session_file = session_file
def manipulate_browser(browser):
global lock
with lock:
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "none"
chrome_options = webdriver.ChromeOptions();
current_dir = os.path.dirname(os.path.abspath(__file__))
os.mkdir(current_dir + '\\' + browser.session_file)
chrome_options.add_argument(r'--user-data-dir=' + current_dir + '\\' + browser.session_file + '\\selenium')
chrome_options.add_argument("--window-size=750,750")
browser.driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options, desired_capabilities=caps)
while True:
browser.driver.get(browser.url)
time.sleep(4)
def start_browsers():
for browser in browsers:
browser.thread = threading.Thread(target=manipulate_browser, args=(browser,))
browser.thread.start()
if __name__=='__main__':
lock = threading.Lock()
threads = []
urls = 'https://google.com', 'https://facebook.com', 'https://instagram.com', 'https://snapchat.com', 'https://stackoverflow.com', 'https://amazon.com', 'https://microsoft.com'#, 'https://stackoverflow.com', 'https://youtube.com', 'https://yahoo.com'
for url in urls:
session_file = 'session_' + ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(10))
newBrowser = Browser(url, session_file)
browsers.append(newBrowser)
root = tk.Tk()
button_start_browsers = tk.Button(root, command=start_browsers, width=50, height=4, bg='red', text='Start Browsers')
button_start_browsers.pack()
After implementing this, it doesn't work anymore : meaning : sometimes, I get this error selenium.common.exceptions.WebDriverException: Message: unknown error: cannot parse internal JSON template: Line: 1, column: 1, Unexpected token. And all the threads seem to work on just one driver (the last one). I believe this is because the threads mix with each other so I have to use a lock, but putting this line browser.driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options, desired_capabilities=caps) under the lock will affect drastically my speed, that I need. I also understand that I have to use ThreadPool, for avoiding thread mixing and I tried it, but it freezed the GUI app. Another option that others suggest is implementing queue with threads, which I am not very familiar with. I also thought of trying multiprocessing, but the maximum numbers of processes depends on every machine characteristics, from my understanding, and maybe I want to start more processes than that. How can I workaround this situation to achieve my goal? What is the best way?
(Note : I've done plenty of research regarding my issue, but I still couldn't figure it out. Any help is welcomed !)
EDIT :
Due to multiple testing, I figured out that the line chrome_options.add_argument(r'--user-data-dir=' + current_dir + '\\' + browser.session_file + '\\selenium') may cause all the problems. This line keeps the chrome sessions all in one icon and I really like this feature and I wouldn't like getting rid of it. How can I overcome the issue, knowing this information? Any help would be appreciated !
I have a list of article titles and ids that are used to generate the urls of the articles and scrape the contents. I'm using multiprocessing.Pool to parallelize the work. Here's my code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from article import Article
from signal import signal, SIGTERM
import multiprocessing as mp
import sys
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.binary_location = '*path*\chrome.exe'
driver = webdriver.Chrome(executable_path="chromedriver", chrome_options=chrome_options)
def get_article(args):
title, id, q = args
article = Article.from_url('https://*url*/article/{}'.format(id), driver, title=title, id=id)
print('parsed article: ', title)
q.put(article.to_json())
def file_writer(q):
with open('data/articles.json', 'w+') as file:
while True:
line = q.get()
if line == 'END':
break
file.write(line + '\n')
file.flush()
if __name__ == '__main__':
manager = mp.Manager()
queue = manager.Queue()
pool_size = mp.cpu_count() - 2
pool = mp.Pool(pool_size)
writer = mp.Process(target=file_writer, args=(queue,))
writer.start()
with open('data/article_list.csv', 'r') as article_list:
article_list_with_queue = [(*line.split('|'), queue) for line in article_list]
pool.map(get_article, article_list_with_queue)
queue.put('END')
pool.close()
pool.join()
driver.close()
The code executes fine, but after it is finished I have about 80 child processes in PyCharm.exe. Most are chrome.exe, some - chromedriver.exe.
I tried to put
signal(SIGTERM, terminate)
in the worker function and quit the drivers in terminate(), but that doesn't work.
You can create .bat file for kill all processes:
#echo off
rem just kills stray local chromedriver.exe instances.
rem useful if you are trying to clean your project, and your ide is complaining.
taskkill /im chromedriver.exe /f
And run it after all tests
I'm incredibly new to separating modules. I have this long Python script that I want to separate into different files by class and run them collectively in the same browser instance/window. The reason for this is all the tests are reliant on being logged into the same session. I'd like to do a universal setUp, then login, and then pull the different tests in one after another.
Folder structure is:
ContentCreator
- main.py
- _init_.py
- Features
- login.py
- pytest.py
- _init_.py
Here is my code:
login.py
import unittest
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
import json
driver = webdriver.Chrome()
class logIn(unittest.TestCase):
#classmethod
def test_login(self):
"""Login"""
driver.get("sign_in_url")
# load username and pw through a json file
with open('path/to/file.json', 'r') as f:
config = json.load(f)
# login
driver.find_element_by_id("email").click()
driver.find_element_by_id("email").clear()
driver.find_element_by_id("email").send_keys(config['user']['name'])
driver.find_element_by_id("password").click()
driver.find_element_by_id("password").clear()
driver.find_element_by_id("password").send_keys(config['user']['password'])
driver.find_element_by_id("submit").click()
time.sleep(3)
print("You are Logged In!")
pytest.py
import time
import unittest
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from displays import DISPLAY_TYPES, DISPLAY_NAMES
driver = webdriver.Chrome()
#driver.get("url")
class createContent(unittest.TestCase):
#classmethod
def test_add_required(self):
"""Test adding all the required fields across all sites:"""
for i in range(1):
"""This is the number of each type of article that will be created."""
for i in range(1):
"""This is how many different article types that will be created."""
print("create new content")
time.sleep(1)
driver.find_element_by_link_text("Content").click()
time.sleep(1)
driver.find_element_by_link_text("Create New").click()
print("select a display type:")
display = DISPLAY_TYPES
display_type = driver.find_element_by_id(display[i])
display_type.click()
names = (DISPLAY_NAMES[i])
print(names), (" created and saved successfully!")
#classmethod
def tearDownClass(cls):
# close the browser window
driver.quit()
def is_element_present(self, how, what):
"""
Helper method to confirm the presence of an element on page
:params how: By locator type
:params what: locator value
"""
try:
driver.find_element(by=how, value=what)
except NoSuchElementException:
return False
return True
main.py
import unittest
from HtmlTestRunner import HTMLTestRunner
from features.login import logIn
from features.pytest import createContent
login_script = unittest.TestLoader().loadTestsFromTestCase(logIn)
add_pytest = unittest.TestLoader().loadTestsFromTestCase(createContent)
# create a test suite combining all tests
test_suite = unittest.TestSuite([login, add_pytest])
# create output
runner = HTMLTestRunner(output='Test Results')
# run the suite
runner.run(test_suite)
When running the above code it opens two browser sessions, and only the login script get executed. The test fails do to not finding the elements outlined in the next script.
EDIT:
Alfonso Jimenez or anyone else, here's what I have so far...
Folder structure:
- Files
- singleton.py
- singleton2.py
New Singleton code...
singleton.py:
from robot.api import logger
from robot.utils import asserts
from selenium import webdriver
class Singleton(object):
instance = None
def __new__(cls, base_url, browser='chrome'):
if cls.instance is None:
i = object.__new__(cls)
cls.instance = i
cls.base_url = base_url
cls.browser = browser
if browser == "chrome":
# Create a new instance of the Chrome driver
cls.driver = webdriver.Chrome()
else:
# Sorry, we can't help you right now.
asserts.fail("Support for Chrome only!")
else:
i = cls.instance
return i
singleton2.py:
import time
import json
from datetime import datetime
from singleton import Singleton
driver = Singleton('base_url')
def teardown_module(module):
driver.quit()
class logIn(object):
def test_login(self):
"""Login"""
driver.get("url.com")
# load username and pw through a json file
with open('file.json', 'r') as f:
config = json.load(f)
# login
driver.find_element_by_id("email").click()
driver.find_element_by_id("email").clear()
driver.find_element_by_id("email").send_keys(config['user']['name'])
driver.find_element_by_id("password").click()
driver.find_element_by_id("password").clear()
driver.find_element_by_id("password").send_keys(config['user']['password'])
driver.find_element_by_id("submit").click()
time.sleep(3)
print("You are Logged In!")
# take screenshot
driver.save_screenshot('path/screenshot_{}.png'.format(datetime.now()))
The result is that an instance of Chrome kicks off, but nothing happens. The base_url (or any other URL defined in my test) doesn't come up. The blank window is all I get. Any insights on what I'm doing wrong?
You're instantiating two times the selenium driver.
If you want to keep the same session opened you should pass the same object to both scripts, or import it, what it could be valid, however it would be a more dirty solution.
The best thing to do is create a singleton class to initiate the driver. Once you have done this, every time you create an object from this class you will get the a unique object of webdriver.
You can get an example from this answer.
You can also check more about singleton instances, they're a very common and very useful. You can check here.
I dont understand what you mean with robot, perhaps the testing framework?
You can write the singleton class wherever you want to. You will have to import the class from that place and then instantiate the object. Ex:
lib/singleton_web_driver.py
from robot.api import logger
from robot.utils import asserts
from selenium import webdriver
class Singleton(object):
instance = None
def __new__(cls, base_url, browser='firefox'):
if cls.instance is None:
i = object.__new__(cls)
cls.instance = i
cls.base_url = base_url
cls.browser = browser
if browser == "firefox":
# Create a new instance of the Firefox driver
cls.driver = webdriver.Firefox()
elif browser == "remote":
# Create a new instance of the Chrome driver
cls.driver = webdriver.Remote("http://localhost:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)
else:
# Sorry, we can't help you right now.
asserts.fail("Support for Firefox or Remote only!")
else:
i = cls.instance
return i
and then in every script youre going to need the webdriver:
test_script_file.py
from lib.singleton_web_driver import Singleton
driver = Singleton('base_url')
This just a dummy code, I dindnt test it. The important point is to create the class with the _new_ method where you can check if the class has already been called. The import is just like any other class import, you write the class in a folder and then import it in the scripts youre going to use.
I had a similar problem. My solution was just to initiate the driver in the main file and then to use this driver within the imported files and functions. Like in this example to change createContent(unittest.TestCase) to createContent(unittest.TestCase, driver)