How to appropriately use selenium and parallel processing - python

I am trying to scrape a bunch of urls using Selenium and BeautifulSoup. Because they are thousands and the processing that I need to do is complex and uses a lot of CPU, I need to do multiprocessing (as opposed to multithreading).
The problem right now is that I am opening and closing a Chromedriver instance once for each URL, which adds a lot of overhead and makes the process slow.
What I want to do is instead have a chromedriver instance for each subprocess, only open it once and keep it open until the subprocess finishes. However, my attempts to do it have been unsuccessful.
I tried creating the instances in the main process, dividing the set of URLS into the number of processes and sending each subprocess its subset of urls and a single driver as arguments, so that each subprocess would cycle through the urls that it got. But that didn't run at all, it did not give either results or error.
A solution similar to this with multiprocessing instead of threading got me a recursion-limit error (changing the recursion limit using sys would not help at all).
What else could I do to make this faster?
Below are the relevant parts of the code that actually works.
from bs4 import BeautifulSoup
import re
import csv
from datetime import datetime
import numpy as np
import concurrent.futures
import multiprocessing
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--window-size=1920x1080')
options.add_argument('--no-sandbox')
def runit(row):
driver = webdriver.Chrome(chrome_options=options)
driver.set_page_load_timeout(500)
driver.implicitly_wait(500)
url = row[1]
driver.get(url)
html_doc = driver.page_source
driver.quit()
soup = BeautifulSoup(html_doc, 'html.parser')
# Some long processing code that uses the soup object and generates the result object that is returned below with what I want
return result, row
if __name__ == '__main__':
multiprocessing.freeze_support()
print(datetime.now())
# The file below has the list of all the pages that I need to process, along with some other pieces of relevant data
# The URL is the second field in the csv file
with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
# I have 4 cores but Windows shows 8 logical processors, I have tried other numbers below 8, but 8 seems to bring the fastest results
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
results = executor.map(runit, csv_reader)
#At a later time I will code here what I will do with the results after all the processes finish.
print(datetime.now())

At end of the day, you need more compute power to be running these types of test i.e. multiple computers, browerstack, saucelabs etc. Also, look into Docker where you can use your grid implementation to run tests on more than one browser.
https://github.com/SeleniumHQ/docker-selenium

I found a possible solution to my question myself.
The error I was making in my alternative solutions (not shown above) was that I was trying to create all the drivers in the main process and pass it as argument to each subprocess. This did not work well. So what I did instead is to create each chromedriver instance inside each subprocess, as you will see in my code below. Please note, however, that this code is not entirely efficient. This is because the rows are divided evenly by count between all the subprocesses, and not all pages are equal. This means that some subprocesses finish earlier than others, leading to subutilization of the CPU at the end. This, however, takes 42% less time than having the chromedirver instance open and quit for each URL. If anyone has a solution that would allow to do both things (efficient use of the CPU and each subprocess having its own chromedriver instance), I would be thankful.
def runit(part):
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(500)
driver.set_page_load_timeout(500)
debug = False
results = []
keys = []
#the subprocess now receives a bunch of rows instead of just one
#so I have to cycle through them now
for row in part:
result = None
try:
#processFile is a function that does the processing of each URL
result = processFile(row[1], debug, driver)
except Exception as e:
exc = str(e)
print(f"EXCEPTION: {row[0]} caused {exc}")
results.append(results)
keys.append(row[0])
driver.quit()
return results, keys
if __name__ == '__main__':
multiprocessing.freeze_support()
maxprocessors = 8
print(datetime.now())
rows = []
with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
rows.append(row)
parts = []
# I separate the rows into equal parts by count
# However these parts are not equal in terms of required CPU time
# Which creates CPU subutilization at the end
for i in range(0, maxprocessors):
parts.append(rows[i::maxprocessors])
with concurrent.futures.ProcessPoolExecutor(max_workers=maxprocessors) as executor:
results = executor.map(runit, parts)
print(datetime.now())

Related

How can I run two(or more) selenium's webdrivers at the same time in Python? [duplicate]

This question already has answers here:
Python selenium multiprocessing
(3 answers)
Closed last month.
I'm trying to run two(or more) selenium webdrivers with Python at the same time
I have so far tried using Python's Multiprocessing module, I have used it this way:
def _():
sets = list()
pool = Pool()
for i in range(len(master)):
driver = setProxy(proxy,f'Tread#{i+1}')
sets.append(
[f'Thread#{i+1}',
driver,
master[i]]
)
for i in range(len(sets)):
pool.apply_async(enterPoint, args=(sets[i][0],sets[i][1],sets[i][2]))
pool.close()
pool.join()
The function above calls setProxy() to get a driver instance with a proxy set to it, which works perfectly and opens a chromedriver len(master) amount of times and accesses a link to check the IP. The sets list is a list of lists that consist of 3 objects that are the Thread number, the driver which will run and a list with the data that the driver will use. Pool's apply_async() should run enterPoint() len(sets) of times, and the args are Thread number, driver and the data
Here's enterPoint code:
def enterPoint(thread,driver,accounts):
print('I exist!')
for account in accounts:
cEEPE(thread,driver,account)
But the 'I exist' statement never gets printed out in the CLI I'm running the application at.
cEEPE() is where the magic happens. I've tested my code without applying multiprocessing to it and it works as it should.
I suspect there's a problem in Pool's apply_async() method, which I might have used it the wrong way.
The code provided in the question is in isolation, so its harder to comment on, but I would set about using this process given the problem described:
import multiprocessing & selenium
use start & join methods.
This would produce the two (or more) processes that you ask for.
import multiprocessing
from selenium import webdriver
def open_browser(name):
driver = webdriver.Firefox()
driver.get("http://www.google.com")
print(name, driver.title)
driver.quit()
if __name__ == '__main__':
process1 = multiprocessing.Process(target=open_browser, args=("Process-1",))
process2 = multiprocessing.Process(target=open_browser, args=("Process-2",))
process1.start()
process2.start()
process1.join()
process2.join()
So, I got the code above to work, here's how I fixed it:
instead of writing the apply_async() method like this:
pool.apply_async(enterPoint, args=(sets[i][0],sets[i][1],sets[i][2]))
here's how I wrote it:
pool.apply_async(enterPoint(sets[i][0],sets[i][1],sets[i][2]))
But still, this does not fix my issue since I would like enterPoint to run twice at the same time..
It can be done easily with SeleniumBase, which can multi-thread tests (Eg: -n=3 for 3 threads), or even set a proxy server (--proxy=USER:PASS#SERVER:PORT)
pip install seleniumbase, then run with python:
from parameterized import parameterized
from seleniumbase import BaseCase
BaseCase.main(__name__, __file__, "-n=3")
class GoogleTests(BaseCase):
#parameterized.expand(
[
["Download Python", "Download Python", "img.python-logo"],
["Wikipedia", "www.wikipedia.org", "img.central-featured-logo"],
["SeleniumBase.io Docs", "SeleniumBase", 'img[alt*="SeleniumB"]'],
]
)
def test_parameterized_google_search(self, search_key, expected_text, img):
self.open("https://google.com/ncr")
self.hide_elements("iframe")
self.type('input[title="Search"]', search_key + "\n")
self.assert_text(expected_text, "#search")
self.click('a:contains("%s")' % expected_text)
self.assert_element(img)
(This example uses parameterized to turn one test into three different ones.) You can also apply the multi-threading to multiple files, etc.

Run Multithreaded Selenium Automation with multiple browser instances

Im trying to upload files i have made a function which does this job, im trying to make the program to run multithread and upload for example 10 files instead of 1.
This my code i have tried clean it and only leave the important part and clarify the structure. im trying to find a way to open more browser instances and upload the files faster.
def upload():
# DRIVERS
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path=path, options = options)
# ACTION START
driver.get(url)
time.sleep(2)
# UPLOAD CODE FOR 1 FILE
# I USE THE VARIABLE N TO LOOP THROUGH THE FILES
# ALL FILES ARE NAMED FROM 0 UP
# FILE UPLOADED.. SLEEPING
time.sleep(1200)
driver.quit()
obj = driver.switch_to.alert
obj.accept
# FILES LIST IN A TEXT FORMAT TO LOOP THROUGH
with open(r'path\for\fileslist') as f:
lines = f.read().splitlines()
threads = []
for line in lines:
t = threading.Thread(target=upload, args=(webdriver.Chrome(path),))
t.start()
threads.append(t)
# ADD 1 EACH TIME TO UPLOAD THE NEXT FILE
N += 1
for thread in threads:
thread.join()
Your case is quite specific. You should be using multiprocessing module instead of threading for this purpose. Or better yet instead of having multiple chrome selenium windows, open 10 tabs with the same url and start uploading in all of them. You can find how to use this functionality here
If you need a confirmation of upload after, you might want to use multiprocessing so each process handles a seperate tab
Make sure you have enough CPU and RAM since selenium likes spiking up those two quite often

Python Multithreading Rest API

I download Data over a restAPI and wrote a module. The download takes lets say 10sec. During this time, the rest of the script in 'main' and in the module is not running until the download is finished.
How can I change it, e.g. by processing it in another core?
I tried this code but it does not do the trick (same lag). Then I tried to implement this approach and it just gives me errors, as I suspect it 'map' does not work with 'wget.download'?
My code from the module:
from multiprocessing.dummy import Pool as ThreadPool
import urllib.parse
#define the needed data
function='TIME_SERIES_INTRADAY_EXTENDED'
symbol='IBM'
interval='1min'
slice='year1month1'
adjusted='true'
apikey= key[0].rstrip()
#create URL
SCHEME = os.environ.get("API_SCHEME", "https")
NETLOC = os.environ.get("API_NETLOC", "www.alphavantage.co") #query?
PATH = os.environ.get("API_PATH","query")
query = urllib.parse.urlencode(dict(function=function, symbol=symbol, interval=interval, slice=slice, adjusted=adjusted, apikey=apikey))
url = urllib.parse.urlunsplit((SCHEME, NETLOC,PATH, query, ''))
#this is my original code to download the data (working but slow and stopping the rest of the script)
wget.download(url, 'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv')
#this is my attempt to speed things up via multithreading from code
pool = ThreadPool(4)
if __name__ == '__main__':
futures = []
for x in range(1):
futures.append(pool.apply_async(wget.download, url,'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get())
any suggestions or do you see the error i make?
ok, i figured it out, so i will leave it here in case someone else needs it.
I made a module called APIcall which has a function APIcall() which uses wget.download() to download my data.
in main, i create a function (called threaded_APIfunc) which calls the APIcall() function in my modul APIcall
import threading
import APIcall
def threaded_APIfunc():
APIcall.APIcall(function, symbol, interval, slice, adjusted, apikey)
print ("Data Download complete for ${}".format(symbol))
and then i run the threaded_APIfunc within a thread like so
threading.Thread(target=threaded_APIfunc).start()
print ('Start Downloading Data for ${}'.format(symbol))
what happends is, that the .csv file gets downloaded in the background, while the main loop doesent wait till the download ir completed, it does the code what comes after the threading right away

Web Scraping with Multiprocessing in Python Stalls Randomly

I am attempting to scrape a list of urls using a function called data_finder, where a url is the only argument. The list of urls is called urls.
To speed up the process, I am using the multiprocessing package in Python 3 on Windows 10. The code I am using is below:
if __name__ == '__main__':
multiprocessing.freeze_support()
p = multiprocessing.Pool(10)
records = p.map(data_finder, urls)
p.close()
p.join()
print('Successfully exported.')
with open('test.json', 'w') as outfile:
json.dump(records, outfile)
The problem I am having is that sometimes the code freezes and is unable to continue, but other times it does work as expected. Whenever it does freeze though, it is usually in the last 10 urls. Is this a common occurrence or is there a solution to this?
Have you tried timing the request call to check if that is what is stalling? From my understanding of 'sometimes' it looks to me like the network is causing the delay.

Selenium Multiprocessing Help in Python 3.4

I am in over my head trying to use Selenium to get the number of results for specific searches on a website. Basically, I'd like to make the process run faster. I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV.
I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. I've tried to follow other examples on here, but have not been able to successfully implement them. Below is my code so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool
def websitesearch(search):
try:
start = list_of_inputs[0]
end = list_of_inputs[1]
newsabbv=list_of_inputs[2]
directory=list_of_inputs[3]
os.chdir(directory)
if search == broad:
specification = "broad"
relPapers = newsabbv
elif search == narrow:
specification = "narrow"
relPapers = newsabbv
elif search == general:
specification = "allarticles"
relPapers = newsabbv
else:
for newspapers in relPapers:
...rest of code here that gets the data and puts it in a list named all_Data...
browser.close()
df = pd.DataFrame(all_Data)
df.to_csv(filename, index=False)
except:
print('error with item')
if __name__ == '__main__':
...Initializing values and things like that go here. This helps with the setup for search...
#These are things that go into the function
start = ["January",2015]
end = ["August",2017]
directory = "STUFF GOES HERE"
newsabbv = all_news_abbv
search_list = [narrow, broad, general]
list_of_inputs = [start,end,newsabbv,directory]
pool = Pool(processes=4)
for search in search_list:
pool.map(websitesearch, search_list)
print(list_of_inputs)
If I add in a print statement in the main() function, it will print, but nothing really ends up happening. I'd appreciate any and all help. I left out the code that gets the values and puts it into a list since its convoluted but I know it works.
Thanks in advance for any and all help! Let me know if there is more information I can provide.
Isaac
EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). I have updated my code to reflect my current approach that is still not working. I also moved the initializing values into the main function.
i don't think it can be multiprocessing with your way. Because it's still have queue process there (not queue module) caused by selenium.
The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source.
the best way to make real multiprocess is:
make a script that use selenium to handle that url to crawl by selenium and save it as a file. e.g crawler.py and make sure the script have print command to print the result
e.g:
import -> all modules that you need to run selenium
import sys
url = sys.argv[1] #you will catch the url
driver = ......#open browser
driver.get(url)
#just continue the script base on your method
print(--the result that you want--)
sys.exit(0)
i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood.
make another script file that:
a. devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). but every pages of the website have the different url. just collect all url and devide it to several groups (best practice: your cpu cores - 1)
e.g:
import multiprocessing as mp
cpucore=int(mp.cpu_count())-1.
b. send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). make sure you run the crawl.py max == the cpucore.
e.g:
crawler = r'YOUR FILE DIRECTORY\crawler.py'
def devideurl():
global url1, url2, url3, url4
make script that result:
urls1 = groups or list of url
urls2 = groups or list of url
urls3 = groups or list of url
urls4 = groups or list of url
def target1():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
#do you see the combination between python crawler and url?
#the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py
def target2():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target3():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target4():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
pool.map(target1, devideurl)
pool.map(target2, devideurl)
pool.map(target3, devideurl)
pool.map(target4, devideurl)
#you can make it, more, depend on your cpu core
c. get the printed result to the memory of main script
d. continous your script process to process the data that you already got.
and the last, make the multiprocess script for the whole process in the main script.
with this method:
you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. means it's more faster than your method before.
hopelly helpfull...cheers

Categories

Resources