Run Multithreaded Selenium Automation with multiple browser instances - python

Im trying to upload files i have made a function which does this job, im trying to make the program to run multithread and upload for example 10 files instead of 1.
This my code i have tried clean it and only leave the important part and clarify the structure. im trying to find a way to open more browser instances and upload the files faster.
def upload():
# DRIVERS
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path=path, options = options)
# ACTION START
driver.get(url)
time.sleep(2)
# UPLOAD CODE FOR 1 FILE
# I USE THE VARIABLE N TO LOOP THROUGH THE FILES
# ALL FILES ARE NAMED FROM 0 UP
# FILE UPLOADED.. SLEEPING
time.sleep(1200)
driver.quit()
obj = driver.switch_to.alert
obj.accept
# FILES LIST IN A TEXT FORMAT TO LOOP THROUGH
with open(r'path\for\fileslist') as f:
lines = f.read().splitlines()
threads = []
for line in lines:
t = threading.Thread(target=upload, args=(webdriver.Chrome(path),))
t.start()
threads.append(t)
# ADD 1 EACH TIME TO UPLOAD THE NEXT FILE
N += 1
for thread in threads:
thread.join()

Your case is quite specific. You should be using multiprocessing module instead of threading for this purpose. Or better yet instead of having multiple chrome selenium windows, open 10 tabs with the same url and start uploading in all of them. You can find how to use this functionality here
If you need a confirmation of upload after, you might want to use multiprocessing so each process handles a seperate tab
Make sure you have enough CPU and RAM since selenium likes spiking up those two quite often

Related

How can I run two(or more) selenium's webdrivers at the same time in Python? [duplicate]

This question already has answers here:
Python selenium multiprocessing
(3 answers)
Closed last month.
I'm trying to run two(or more) selenium webdrivers with Python at the same time
I have so far tried using Python's Multiprocessing module, I have used it this way:
def _():
sets = list()
pool = Pool()
for i in range(len(master)):
driver = setProxy(proxy,f'Tread#{i+1}')
sets.append(
[f'Thread#{i+1}',
driver,
master[i]]
)
for i in range(len(sets)):
pool.apply_async(enterPoint, args=(sets[i][0],sets[i][1],sets[i][2]))
pool.close()
pool.join()
The function above calls setProxy() to get a driver instance with a proxy set to it, which works perfectly and opens a chromedriver len(master) amount of times and accesses a link to check the IP. The sets list is a list of lists that consist of 3 objects that are the Thread number, the driver which will run and a list with the data that the driver will use. Pool's apply_async() should run enterPoint() len(sets) of times, and the args are Thread number, driver and the data
Here's enterPoint code:
def enterPoint(thread,driver,accounts):
print('I exist!')
for account in accounts:
cEEPE(thread,driver,account)
But the 'I exist' statement never gets printed out in the CLI I'm running the application at.
cEEPE() is where the magic happens. I've tested my code without applying multiprocessing to it and it works as it should.
I suspect there's a problem in Pool's apply_async() method, which I might have used it the wrong way.
The code provided in the question is in isolation, so its harder to comment on, but I would set about using this process given the problem described:
import multiprocessing & selenium
use start & join methods.
This would produce the two (or more) processes that you ask for.
import multiprocessing
from selenium import webdriver
def open_browser(name):
driver = webdriver.Firefox()
driver.get("http://www.google.com")
print(name, driver.title)
driver.quit()
if __name__ == '__main__':
process1 = multiprocessing.Process(target=open_browser, args=("Process-1",))
process2 = multiprocessing.Process(target=open_browser, args=("Process-2",))
process1.start()
process2.start()
process1.join()
process2.join()
So, I got the code above to work, here's how I fixed it:
instead of writing the apply_async() method like this:
pool.apply_async(enterPoint, args=(sets[i][0],sets[i][1],sets[i][2]))
here's how I wrote it:
pool.apply_async(enterPoint(sets[i][0],sets[i][1],sets[i][2]))
But still, this does not fix my issue since I would like enterPoint to run twice at the same time..
It can be done easily with SeleniumBase, which can multi-thread tests (Eg: -n=3 for 3 threads), or even set a proxy server (--proxy=USER:PASS#SERVER:PORT)
pip install seleniumbase, then run with python:
from parameterized import parameterized
from seleniumbase import BaseCase
BaseCase.main(__name__, __file__, "-n=3")
class GoogleTests(BaseCase):
#parameterized.expand(
[
["Download Python", "Download Python", "img.python-logo"],
["Wikipedia", "www.wikipedia.org", "img.central-featured-logo"],
["SeleniumBase.io Docs", "SeleniumBase", 'img[alt*="SeleniumB"]'],
]
)
def test_parameterized_google_search(self, search_key, expected_text, img):
self.open("https://google.com/ncr")
self.hide_elements("iframe")
self.type('input[title="Search"]', search_key + "\n")
self.assert_text(expected_text, "#search")
self.click('a:contains("%s")' % expected_text)
self.assert_element(img)
(This example uses parameterized to turn one test into three different ones.) You can also apply the multi-threading to multiple files, etc.

Creating threads in Python iterations [duplicate]

I have done some research and the consensus appears to state that this is impossible without a lot of knowledge and work. However:
Would it be possible to run the same test in different tabs simultaneously?
If so, how would I go about that? I'm using python and attempting to run 3-5 of the same test at once.
This is not a generic test, hence I do not care if it interrupts a clean testing environment.
I think you can do that. But I feel the better or easier way to do that is using different windows. Having said that we can use either multithreading or multiprocessing or subprocess module to trigger the task in parallel (near parallel).
Multithreading example
Let me show you a simple example as to how to spawn multiple tests using threading module.
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Firefox()
url = 'https://www.google.co.in'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print(t.name + ' started!')
thread_list.append(t)
# Wait for all threads to complete
for thread in thread_list:
thread.join()
print('Test completed!')
Here I am spawning 5 browsers to run test cases at one time. Instead of implementing the test logic I have put sleep time of 2 seconds for the purpose of demonstration. The code will fire up 5 firefox browsers (tested with python 2.7), open google and wait for 2 seconds before quitting.
Logs:
Test 0 started!
Test 1 started!
Test 2 started!
Test 3 started!
Test 4 started!
Test completed!
Process finished with exit code 0
Python 3.2+
Threads with their own webdriver instances (different windows)
Threads can solve your problem with a good performance boost (some explanation here) on different windows. Also threads are lighter than processes.
You should use a concurrent.futures.ThreadPoolExecutor with each thread using its own webdriver.
Also consider adding the headless option for your webdriver.
The example bellow uses a chrome-webdriver. To exemplify uses integer as argument url_test for the test function selenium_test 6 times.
from concurrent import futures
from selenium import webdriver
def selenium_test(test_url):
chromeOptions = webdriver.ChromeOptions()
#chromeOptions.add_argument("--headless") # make it not visible
driver = webdriver.Chrome(options=chromeOptions)
print("testing url {:0} started".format(test_url))
driver.get("https://www.google.com") # replace here by driver.get(test_url)
#<actual work that needs to be done be selenium>
driver.quit()
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
future_test_results = [ executor.submit(selenium_test, i)
for i in range(6) ] # running same test 6 times, using test number as url
for future_test_result in future_test_results:
try:
test_result = future_test_result.result() # can use `timeout` to wait max seconds for each thread
#... do something with the test_result
except Exception as exc: # can give a exception in some thread, but
print('thread generated an exception: {:0}'.format(exc))
Outputs:
testing url 1 started
testing url 5 started
testing url 3 started
testing url 4 started
testing url 0 started
testing url 2 started
Look at TestNG, you should be able to find frameworks that achieve this.
I did a brief check and here are a couple of links to get you started:
Parallel Execution & Session Handling in Selenium
Parallel Execution using Selenium Webdriver and TestNG
If you want a reliable, rebost framework that can do parallel execution as well as load testing at scale then look at TurboSelenium : https://butlerthing.io/products#demovideo. Drop us a message and will be happy to discuss this with you.

How to appropriately use selenium and parallel processing

I am trying to scrape a bunch of urls using Selenium and BeautifulSoup. Because they are thousands and the processing that I need to do is complex and uses a lot of CPU, I need to do multiprocessing (as opposed to multithreading).
The problem right now is that I am opening and closing a Chromedriver instance once for each URL, which adds a lot of overhead and makes the process slow.
What I want to do is instead have a chromedriver instance for each subprocess, only open it once and keep it open until the subprocess finishes. However, my attempts to do it have been unsuccessful.
I tried creating the instances in the main process, dividing the set of URLS into the number of processes and sending each subprocess its subset of urls and a single driver as arguments, so that each subprocess would cycle through the urls that it got. But that didn't run at all, it did not give either results or error.
A solution similar to this with multiprocessing instead of threading got me a recursion-limit error (changing the recursion limit using sys would not help at all).
What else could I do to make this faster?
Below are the relevant parts of the code that actually works.
from bs4 import BeautifulSoup
import re
import csv
from datetime import datetime
import numpy as np
import concurrent.futures
import multiprocessing
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--window-size=1920x1080')
options.add_argument('--no-sandbox')
def runit(row):
driver = webdriver.Chrome(chrome_options=options)
driver.set_page_load_timeout(500)
driver.implicitly_wait(500)
url = row[1]
driver.get(url)
html_doc = driver.page_source
driver.quit()
soup = BeautifulSoup(html_doc, 'html.parser')
# Some long processing code that uses the soup object and generates the result object that is returned below with what I want
return result, row
if __name__ == '__main__':
multiprocessing.freeze_support()
print(datetime.now())
# The file below has the list of all the pages that I need to process, along with some other pieces of relevant data
# The URL is the second field in the csv file
with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
# I have 4 cores but Windows shows 8 logical processors, I have tried other numbers below 8, but 8 seems to bring the fastest results
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
results = executor.map(runit, csv_reader)
#At a later time I will code here what I will do with the results after all the processes finish.
print(datetime.now())
At end of the day, you need more compute power to be running these types of test i.e. multiple computers, browerstack, saucelabs etc. Also, look into Docker where you can use your grid implementation to run tests on more than one browser.
https://github.com/SeleniumHQ/docker-selenium
I found a possible solution to my question myself.
The error I was making in my alternative solutions (not shown above) was that I was trying to create all the drivers in the main process and pass it as argument to each subprocess. This did not work well. So what I did instead is to create each chromedriver instance inside each subprocess, as you will see in my code below. Please note, however, that this code is not entirely efficient. This is because the rows are divided evenly by count between all the subprocesses, and not all pages are equal. This means that some subprocesses finish earlier than others, leading to subutilization of the CPU at the end. This, however, takes 42% less time than having the chromedirver instance open and quit for each URL. If anyone has a solution that would allow to do both things (efficient use of the CPU and each subprocess having its own chromedriver instance), I would be thankful.
def runit(part):
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(500)
driver.set_page_load_timeout(500)
debug = False
results = []
keys = []
#the subprocess now receives a bunch of rows instead of just one
#so I have to cycle through them now
for row in part:
result = None
try:
#processFile is a function that does the processing of each URL
result = processFile(row[1], debug, driver)
except Exception as e:
exc = str(e)
print(f"EXCEPTION: {row[0]} caused {exc}")
results.append(results)
keys.append(row[0])
driver.quit()
return results, keys
if __name__ == '__main__':
multiprocessing.freeze_support()
maxprocessors = 8
print(datetime.now())
rows = []
with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
rows.append(row)
parts = []
# I separate the rows into equal parts by count
# However these parts are not equal in terms of required CPU time
# Which creates CPU subutilization at the end
for i in range(0, maxprocessors):
parts.append(rows[i::maxprocessors])
with concurrent.futures.ProcessPoolExecutor(max_workers=maxprocessors) as executor:
results = executor.map(runit, parts)
print(datetime.now())

Selenium Multiprocessing Help in Python 3.4

I am in over my head trying to use Selenium to get the number of results for specific searches on a website. Basically, I'd like to make the process run faster. I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV.
I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. I've tried to follow other examples on here, but have not been able to successfully implement them. Below is my code so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool
def websitesearch(search):
try:
start = list_of_inputs[0]
end = list_of_inputs[1]
newsabbv=list_of_inputs[2]
directory=list_of_inputs[3]
os.chdir(directory)
if search == broad:
specification = "broad"
relPapers = newsabbv
elif search == narrow:
specification = "narrow"
relPapers = newsabbv
elif search == general:
specification = "allarticles"
relPapers = newsabbv
else:
for newspapers in relPapers:
...rest of code here that gets the data and puts it in a list named all_Data...
browser.close()
df = pd.DataFrame(all_Data)
df.to_csv(filename, index=False)
except:
print('error with item')
if __name__ == '__main__':
...Initializing values and things like that go here. This helps with the setup for search...
#These are things that go into the function
start = ["January",2015]
end = ["August",2017]
directory = "STUFF GOES HERE"
newsabbv = all_news_abbv
search_list = [narrow, broad, general]
list_of_inputs = [start,end,newsabbv,directory]
pool = Pool(processes=4)
for search in search_list:
pool.map(websitesearch, search_list)
print(list_of_inputs)
If I add in a print statement in the main() function, it will print, but nothing really ends up happening. I'd appreciate any and all help. I left out the code that gets the values and puts it into a list since its convoluted but I know it works.
Thanks in advance for any and all help! Let me know if there is more information I can provide.
Isaac
EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). I have updated my code to reflect my current approach that is still not working. I also moved the initializing values into the main function.
i don't think it can be multiprocessing with your way. Because it's still have queue process there (not queue module) caused by selenium.
The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source.
the best way to make real multiprocess is:
make a script that use selenium to handle that url to crawl by selenium and save it as a file. e.g crawler.py and make sure the script have print command to print the result
e.g:
import -> all modules that you need to run selenium
import sys
url = sys.argv[1] #you will catch the url
driver = ......#open browser
driver.get(url)
#just continue the script base on your method
print(--the result that you want--)
sys.exit(0)
i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood.
make another script file that:
a. devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). but every pages of the website have the different url. just collect all url and devide it to several groups (best practice: your cpu cores - 1)
e.g:
import multiprocessing as mp
cpucore=int(mp.cpu_count())-1.
b. send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). make sure you run the crawl.py max == the cpucore.
e.g:
crawler = r'YOUR FILE DIRECTORY\crawler.py'
def devideurl():
global url1, url2, url3, url4
make script that result:
urls1 = groups or list of url
urls2 = groups or list of url
urls3 = groups or list of url
urls4 = groups or list of url
def target1():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
#do you see the combination between python crawler and url?
#the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py
def target2():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target3():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target4():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
pool.map(target1, devideurl)
pool.map(target2, devideurl)
pool.map(target3, devideurl)
pool.map(target4, devideurl)
#you can make it, more, depend on your cpu core
c. get the printed result to the memory of main script
d. continous your script process to process the data that you already got.
and the last, make the multiprocess script for the whole process in the main script.
with this method:
you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. means it's more faster than your method before.
hopelly helpfull...cheers

How to call some functions in the same time?

I'm wondering what is the best way to run some functions in the same time.
I wrote a Python module that runs 3 instances of Firefox with Selenium webdriver, that should load the same page in each one of them.
my code looks like :
url = "http://google.com"
firefox1 = webdriver.Firefox()
firefox2 = webdriver.Firefox()
firefox3 = webdriver.Firefox()
firefox1.get(url)
firefox2.get(url)
firefox3.get(url)
Selenium is very(!) slow, and each one of the page loading takes about a 30-60 secs.
I want to run all of the firefox*.get(url) parallel.
What is the best way to do that?
if its not that big process, you can use thread (although that wouldn't be a perfect parallel, due to python's GIL but still would do your job to some extent)
2) you can use asynchronous programming for this purpose. if its python3 you can use inbuilt library asyncio
here is sample program(I've not tested but it should give you an idea about asyncio)
import asyncio
def func1(args):
print('func1')
def func2(args):
print('func2')
def func3(args):
print('func3')
loop = asyncio.get_event_loop()
flist = [func1(args), func2(args), func3(args)]
w = asyncio.wait(flist)
loop.run_until_complete(w)

Categories

Resources