Creating threads in Python iterations [duplicate] - python

I have done some research and the consensus appears to state that this is impossible without a lot of knowledge and work. However:
Would it be possible to run the same test in different tabs simultaneously?
If so, how would I go about that? I'm using python and attempting to run 3-5 of the same test at once.
This is not a generic test, hence I do not care if it interrupts a clean testing environment.

I think you can do that. But I feel the better or easier way to do that is using different windows. Having said that we can use either multithreading or multiprocessing or subprocess module to trigger the task in parallel (near parallel).
Multithreading example
Let me show you a simple example as to how to spawn multiple tests using threading module.
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Firefox()
url = 'https://www.google.co.in'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print(t.name + ' started!')
thread_list.append(t)
# Wait for all threads to complete
for thread in thread_list:
thread.join()
print('Test completed!')
Here I am spawning 5 browsers to run test cases at one time. Instead of implementing the test logic I have put sleep time of 2 seconds for the purpose of demonstration. The code will fire up 5 firefox browsers (tested with python 2.7), open google and wait for 2 seconds before quitting.
Logs:
Test 0 started!
Test 1 started!
Test 2 started!
Test 3 started!
Test 4 started!
Test completed!
Process finished with exit code 0

Python 3.2+
Threads with their own webdriver instances (different windows)
Threads can solve your problem with a good performance boost (some explanation here) on different windows. Also threads are lighter than processes.
You should use a concurrent.futures.ThreadPoolExecutor with each thread using its own webdriver.
Also consider adding the headless option for your webdriver.
The example bellow uses a chrome-webdriver. To exemplify uses integer as argument url_test for the test function selenium_test 6 times.
from concurrent import futures
from selenium import webdriver
def selenium_test(test_url):
chromeOptions = webdriver.ChromeOptions()
#chromeOptions.add_argument("--headless") # make it not visible
driver = webdriver.Chrome(options=chromeOptions)
print("testing url {:0} started".format(test_url))
driver.get("https://www.google.com") # replace here by driver.get(test_url)
#<actual work that needs to be done be selenium>
driver.quit()
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
future_test_results = [ executor.submit(selenium_test, i)
for i in range(6) ] # running same test 6 times, using test number as url
for future_test_result in future_test_results:
try:
test_result = future_test_result.result() # can use `timeout` to wait max seconds for each thread
#... do something with the test_result
except Exception as exc: # can give a exception in some thread, but
print('thread generated an exception: {:0}'.format(exc))
Outputs:
testing url 1 started
testing url 5 started
testing url 3 started
testing url 4 started
testing url 0 started
testing url 2 started

Look at TestNG, you should be able to find frameworks that achieve this.
I did a brief check and here are a couple of links to get you started:
Parallel Execution & Session Handling in Selenium
Parallel Execution using Selenium Webdriver and TestNG
If you want a reliable, rebost framework that can do parallel execution as well as load testing at scale then look at TurboSelenium : https://butlerthing.io/products#demovideo. Drop us a message and will be happy to discuss this with you.

Related

How can I run two(or more) selenium's webdrivers at the same time in Python? [duplicate]

This question already has answers here:
Python selenium multiprocessing
(3 answers)
Closed last month.
I'm trying to run two(or more) selenium webdrivers with Python at the same time
I have so far tried using Python's Multiprocessing module, I have used it this way:
def _():
sets = list()
pool = Pool()
for i in range(len(master)):
driver = setProxy(proxy,f'Tread#{i+1}')
sets.append(
[f'Thread#{i+1}',
driver,
master[i]]
)
for i in range(len(sets)):
pool.apply_async(enterPoint, args=(sets[i][0],sets[i][1],sets[i][2]))
pool.close()
pool.join()
The function above calls setProxy() to get a driver instance with a proxy set to it, which works perfectly and opens a chromedriver len(master) amount of times and accesses a link to check the IP. The sets list is a list of lists that consist of 3 objects that are the Thread number, the driver which will run and a list with the data that the driver will use. Pool's apply_async() should run enterPoint() len(sets) of times, and the args are Thread number, driver and the data
Here's enterPoint code:
def enterPoint(thread,driver,accounts):
print('I exist!')
for account in accounts:
cEEPE(thread,driver,account)
But the 'I exist' statement never gets printed out in the CLI I'm running the application at.
cEEPE() is where the magic happens. I've tested my code without applying multiprocessing to it and it works as it should.
I suspect there's a problem in Pool's apply_async() method, which I might have used it the wrong way.
The code provided in the question is in isolation, so its harder to comment on, but I would set about using this process given the problem described:
import multiprocessing & selenium
use start & join methods.
This would produce the two (or more) processes that you ask for.
import multiprocessing
from selenium import webdriver
def open_browser(name):
driver = webdriver.Firefox()
driver.get("http://www.google.com")
print(name, driver.title)
driver.quit()
if __name__ == '__main__':
process1 = multiprocessing.Process(target=open_browser, args=("Process-1",))
process2 = multiprocessing.Process(target=open_browser, args=("Process-2",))
process1.start()
process2.start()
process1.join()
process2.join()
So, I got the code above to work, here's how I fixed it:
instead of writing the apply_async() method like this:
pool.apply_async(enterPoint, args=(sets[i][0],sets[i][1],sets[i][2]))
here's how I wrote it:
pool.apply_async(enterPoint(sets[i][0],sets[i][1],sets[i][2]))
But still, this does not fix my issue since I would like enterPoint to run twice at the same time..
It can be done easily with SeleniumBase, which can multi-thread tests (Eg: -n=3 for 3 threads), or even set a proxy server (--proxy=USER:PASS#SERVER:PORT)
pip install seleniumbase, then run with python:
from parameterized import parameterized
from seleniumbase import BaseCase
BaseCase.main(__name__, __file__, "-n=3")
class GoogleTests(BaseCase):
#parameterized.expand(
[
["Download Python", "Download Python", "img.python-logo"],
["Wikipedia", "www.wikipedia.org", "img.central-featured-logo"],
["SeleniumBase.io Docs", "SeleniumBase", 'img[alt*="SeleniumB"]'],
]
)
def test_parameterized_google_search(self, search_key, expected_text, img):
self.open("https://google.com/ncr")
self.hide_elements("iframe")
self.type('input[title="Search"]', search_key + "\n")
self.assert_text(expected_text, "#search")
self.click('a:contains("%s")' % expected_text)
self.assert_element(img)
(This example uses parameterized to turn one test into three different ones.) You can also apply the multi-threading to multiple files, etc.

Handling multiple http request in Python

I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
I need to make 1000 post request to same domain, I was wondering if
there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)

Selenium Multiprocessing Help in Python 3.4

I am in over my head trying to use Selenium to get the number of results for specific searches on a website. Basically, I'd like to make the process run faster. I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV.
I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. I've tried to follow other examples on here, but have not been able to successfully implement them. Below is my code so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool
def websitesearch(search):
try:
start = list_of_inputs[0]
end = list_of_inputs[1]
newsabbv=list_of_inputs[2]
directory=list_of_inputs[3]
os.chdir(directory)
if search == broad:
specification = "broad"
relPapers = newsabbv
elif search == narrow:
specification = "narrow"
relPapers = newsabbv
elif search == general:
specification = "allarticles"
relPapers = newsabbv
else:
for newspapers in relPapers:
...rest of code here that gets the data and puts it in a list named all_Data...
browser.close()
df = pd.DataFrame(all_Data)
df.to_csv(filename, index=False)
except:
print('error with item')
if __name__ == '__main__':
...Initializing values and things like that go here. This helps with the setup for search...
#These are things that go into the function
start = ["January",2015]
end = ["August",2017]
directory = "STUFF GOES HERE"
newsabbv = all_news_abbv
search_list = [narrow, broad, general]
list_of_inputs = [start,end,newsabbv,directory]
pool = Pool(processes=4)
for search in search_list:
pool.map(websitesearch, search_list)
print(list_of_inputs)
If I add in a print statement in the main() function, it will print, but nothing really ends up happening. I'd appreciate any and all help. I left out the code that gets the values and puts it into a list since its convoluted but I know it works.
Thanks in advance for any and all help! Let me know if there is more information I can provide.
Isaac
EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). I have updated my code to reflect my current approach that is still not working. I also moved the initializing values into the main function.
i don't think it can be multiprocessing with your way. Because it's still have queue process there (not queue module) caused by selenium.
The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source.
the best way to make real multiprocess is:
make a script that use selenium to handle that url to crawl by selenium and save it as a file. e.g crawler.py and make sure the script have print command to print the result
e.g:
import -> all modules that you need to run selenium
import sys
url = sys.argv[1] #you will catch the url
driver = ......#open browser
driver.get(url)
#just continue the script base on your method
print(--the result that you want--)
sys.exit(0)
i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood.
make another script file that:
a. devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). but every pages of the website have the different url. just collect all url and devide it to several groups (best practice: your cpu cores - 1)
e.g:
import multiprocessing as mp
cpucore=int(mp.cpu_count())-1.
b. send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). make sure you run the crawl.py max == the cpucore.
e.g:
crawler = r'YOUR FILE DIRECTORY\crawler.py'
def devideurl():
global url1, url2, url3, url4
make script that result:
urls1 = groups or list of url
urls2 = groups or list of url
urls3 = groups or list of url
urls4 = groups or list of url
def target1():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
#do you see the combination between python crawler and url?
#the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py
def target2():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target3():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target4():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
pool.map(target1, devideurl)
pool.map(target2, devideurl)
pool.map(target3, devideurl)
pool.map(target4, devideurl)
#you can make it, more, depend on your cpu core
c. get the printed result to the memory of main script
d. continous your script process to process the data that you already got.
and the last, make the multiprocess script for the whole process in the main script.
with this method:
you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. means it's more faster than your method before.
hopelly helpfull...cheers

How to call some functions in the same time?

I'm wondering what is the best way to run some functions in the same time.
I wrote a Python module that runs 3 instances of Firefox with Selenium webdriver, that should load the same page in each one of them.
my code looks like :
url = "http://google.com"
firefox1 = webdriver.Firefox()
firefox2 = webdriver.Firefox()
firefox3 = webdriver.Firefox()
firefox1.get(url)
firefox2.get(url)
firefox3.get(url)
Selenium is very(!) slow, and each one of the page loading takes about a 30-60 secs.
I want to run all of the firefox*.get(url) parallel.
What is the best way to do that?
if its not that big process, you can use thread (although that wouldn't be a perfect parallel, due to python's GIL but still would do your job to some extent)
2) you can use asynchronous programming for this purpose. if its python3 you can use inbuilt library asyncio
here is sample program(I've not tested but it should give you an idea about asyncio)
import asyncio
def func1(args):
print('func1')
def func2(args):
print('func2')
def func3(args):
print('func3')
loop = asyncio.get_event_loop()
flist = [func1(args), func2(args), func3(args)]
w = asyncio.wait(flist)
loop.run_until_complete(w)

Problems with Speed during web-crawling (Python)

I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code:
from eventlet.green import urllib2
import httplib
import time
import eventlet
# Create the URLS in groups of 400 (+- max for eventlet)
def web_CreateURLS():
print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
for var_indexURLS in xrange(0, 2000000, 400):
var_URLS = []
for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
var_URLS.append("http://www.nu.nl")
web_ScanURLS(var_URLS)
# Return the HTML Source per URL
def web_ReturnHTML(url):
try:
return [urllib2.urlopen(url[0]).read(), url[1]]
except urllib2.URLError:
time.sleep(10)
print "UrlError"
web_ReturnHTML(url)
# Analyse the HTML Source
def web_ScanURLS(var_URLS):
pool = eventlet.GreenPool()
try:
for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
# do something etc..
except TypeError: pass
web_CreateURLS()
I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes.
Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel.
I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily.
And then of course there is always redis pub/sub or sync. :)
"Share the load"

Categories

Resources