I have an asyncio-based crawler that occasionally offloads crawling that requires the browser to a ThreadPoolExecutor, as follows:
def browserfetch(url):
browser = webdriver.Chrome()
browser.get(url)
# Some explicit wait stuff that can take up to 20 seconds.
return browser.page_source
async def fetch(url, loop):
with concurrent.futures.ThreadPoolExecutor() as pool:
result = await loop.run_in_executor(pool, browserfetch, url)
return result
My issue is that I believe this respawns the headless browser each time I call fetch, which incurs browser startup time on each call to webdriver.Chrome. Is there a way for me to refactor browserfetch or fetch so that the same headless driver can be used on multiple fetch calls?
What have I tried?
I've considered more explicit use of threads/pools to start the Chrome instance in a separate thread/process, communicating within the fetch call via queues, pipes, etc (all run in Executors to keep the calls from blocking). I'm not sure how to make this work, though.
I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). The pseudo-code might look like this:
# worker.py
def entrypoint(in_queue, out_queue): # run in process
crawler = Crawler()
browser = Browser()
while not stop:
command = in_queue.get()
result = crawler.process(command, browser)
out_queue.put(result)
# main.py
import worker
in_queue, out_queue = Process(worker.entrypoint)
while not stop:
in_queue.put(new_task)
result = out_queue.get()
Related
i am using ThreadPoolExecutor to get a lot of requests from websites quickly, but sometimes, maybe 1 in 5 times, ThreadPoolExecutor finishes running all of the thread functions and then just freezes without moving on to the rest of my code. I need this to be reliable for a project i'm working on.
from concurrent.futures import ThreadPoolExecutor
import ballotpedialinks as bl
data =[[link,0],[link,1],[link,2]...[link,500]]
def threadFunction(data):
page = data[0]
counter = data[1]
a = bl.checkLink(page)
print(a[0])
if a[0] == '':
links = bl.generateNewLinks(page,state)
for link in links:
a = bl.checkLink(link)
if a[0] != '':
print(f'{a[0]} is a fixed link')
break
def quickRun(threads):
with ThreadPoolExecutor(threads) as pool:
pool.map(threadFunction,data[0:-1])
quickRun(32)
print('scraper complete')
this is basically what i'm doing but thread function is sending requests to websites. executor finishes all the tasks i give it but sometimes it just freezes once its done. Is there anything i can do to make executor not freeze?
This question already has an answer here:
Basic python threading is not working. What am I missing in this?
(1 answer)
Closed 1 year ago.
I found this simple example demonstrating how to use threading to parallelize opening multiple chrome sessions with selenium.
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Chrome()
url = 'https://www.google.de'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print(t.name + ' started!')
thread_list.append(t)
# Wait for all threads to complete
for thread in thread_list:
thread.join()
print('Test completed!')
I tested it and it works. However if I modify the test_logic function to include a variable, i.e. j:
def test_logic(j):
driver = webdriver.Chrome()
url = 'https://www.google.de'
driver.get(url)
# Implement your test logic
time.sleep(j)
driver.quit()
and the corresponding part of threading to:
t = threading.Thread(name='Test {}'.format(i), target=test_logic(i))
the code will stop working in parallel and just runs sequentially.
I don't know what I might have not considered and therefore will be very grateful if anybody can give me some advices. Many thanks!
target=test_logic(i) is invoking the function test_logic and give the return value to the thread.
You may want to do:
t = threading.Thread(name='Test {}'.format(i), target=test_logic, args=[i])
where target is the name of the function, and args is the arguments list for the function.
If you function has 2 args, like def test_logic(a,b), the args should contain 2 values.
More info in Python Thread Documentation
You have to pass arguments to function as below:
t = threading.Thread(name='Test {}'.format(i), target=test_logic, args=(i,))
I'm trying to build a multithreading selenium scraper. Let's say I want to get 100.000 websites and print their page sources, using 20 ChromeDriver instances. By now, I have the following code:
from queue import Queue
from threading import Thread
from selenium import webdriver
from numpy.random import randint
selenium_data_queue = Queue()
worker_queue = Queue()
# Start 20 ChromeDriver instances
worker_ids = list(range(20))
selenium_workers = {i: webdriver.Chrome() for i in worker_ids}
for worker_id in worker_ids:
worker_queue.put(worker_id)
def selenium_task(worker, data):
# Open website
worker.get(data)
# Print website page source
print(worker.page_source)
def selenium_queue_listener(data_queue, worker_queue):
while True:
url = data_queue.get()
worker_id = worker_queue.get()
worker = selenium_workers[worker_id]
# Assign current worker and url to your selenium function
selenium_task(worker, url)
# Put the worker back into the worker queue as it has completed it's task
worker_queue.put(worker_id)
data_queue.task_done()
return
if __name__ == '__main__':
selenium_processes = [Thread(target=selenium_queue_listener,
args=(selenium_data_queue, worker_queue)) for _ in worker_ids]
for p in selenium_processes:
p.daemon = True
p.start()
# Adding urls indefinitely to data queue
# Generating random url just for testing
for i in range(100000):
d = f'http://www.website.com/{i}'
selenium_data_queue.put(d)
# Wait for all selenium queue listening processes to complete
selenium_data_queue.join()
# Tearing down web workers
for b in selenium_workers.values():
b.quit()
My question is: if any ChromeDriver abruptly shuts down (i.e. non-recoverable exception like InvalidSessionIdException), is there a way to remove it from the worker queue and insert a new ChromeDriver in its place, so that I still have 20 usable instances? If so, there's a good pratice to accomplish it?
I'm trying to scrape some urls with Scrapy and Selenium.
Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.
The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.
I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable. I also tried to spawn another process in parse method. But seems that I don't have enought experience to make it right.
In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?
import time
import scrapy
from selenium.webdriver import Firefox
def load_with_selenium(url):
with Firefox() as driver:
driver.get(url)
time.sleep(10) # Do something
page = driver.page_source
return page
class TestSpider(scrapy.Spider):
name = 'test_spider'
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def start_requests(self):
for task in self.tasks:
yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)
def parse(self, response):
if response.meta['selenium']:
response = response.replace(body=load_with_selenium(response.meta['start_url']))
for url in response.xpath('//a/#href').getall():
print(url)
It seems that I've found a solution.
I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.
I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.
So I split my tasks list between the processes. Then inside the crawl(tasks) method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.
import multiprocessing
import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def crawl(tasks):
process = CrawlerProcess(get_project_settings())
def run_spider(_, index=0):
if index < len(tasks):
deferred = process.crawl('test_spider', task=tasks[index])
deferred.addCallback(run_spider, index + 1)
return deferred
run_spider(None)
process.start()
def main():
processes = 2
with multiprocessing.Pool(processes) as pool:
pool.map(crawl, np.array_split(tasks, processes))
if __name__ == '__main__':
main()
The code of TestSpider in my question post must be modified accordingly to accept a task as an init parameter.
def __init__(self, task):
scrapy.Spider.__init__(self)
self.task = task
def start_requests(self):
yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)
Trying to deal with the creation of a webdriver timing out (which happens once in a while covered here). I can't use a signal based timeout because my server is running on Windows so I've been trying to find an alternative.
I looked at the timeout from eventlet but I don't think that will cut it. A time.sleep(10000) doesn't trigger the timeout so I don't think the timeout itself would.
What I'm thinking is calling a thread to create and return the browser and then setting a join timeout. So something like:
def SpawnPhantomJS(dcap, service_args):
browser = webdriver.PhantomJS('C:\phantomjs.exe',desired_capabilities=dcap, service_args=service_args)
print "browser made!"
return browser
proxywrite = '--proxy=',nextproxy
service_args = [
proxywrite,
'--proxy-type=http',
'--ignore-ssl-errors=true',
]
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (nextuseragent)
newDriver = Thread(target=SpawnPhantomJS, args=[dcap, service_args]).start().join(20)
So I'm having some issues with the syntax on how to do this properly in theory this should work. If the creation stalls the SpawnPhamtomJS thread will stall not the main one so the timeout join should help it move on.
Is this possible though? Can I create a webdriver in a thread and return it? Any points appreciated.
Updates:
Just calling a function returned a webcontrol so that bodes well for what I'm trying to do.
newDriver = SpawnPhantomJS(dcap, service_args)
So I'm hoping it's just a syntax issue I have running this as a thread with a timeout.
This didn't do it however:
spawnthread = Thread(target=SpawnPhantomJS, args=[dcap, service_args])
spawnthread.start()
newDriver = spawnthread.join()
Wishful thinking there.
Thread pooling.
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=1)
async_result = pool.apply_async(SpawnPhantomJS, (dcap, service_args))
newDriver = async_result.get(10)