Scraping multiple webpages at once with Selenium

Scraping multiple webpages at once with Selenium - python

I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.
Like bellow:
links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]
for link in links:
browser.get(link )
scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
open("file.csv","a+").write(scrapedinfo)
time.sleep(1)
The greatest problem : too slow!
With this script I will take days or maybe weeks.
Is there a way to increase speed? Such as, by visiting multiple
links at the same time and scraping all at once?
I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.
But, I am unable to apply it in my script.

Threading approach
You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures
def selenium_work(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
#<actual work that needs to be done be selenium>
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
# store the url for each thread as a dict, so we can know which thread fails
future_results = { url : executor.submit(selenium_work, links) for url in links }
for url, future in future_results.items():
try:
future.result() # can use `timeout` to wait max seconds for each thread
except Exception as exc: # can give a exception in some thread
print('url {:0} generated an exception: {:1}'.format(url, exc))
Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.
Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')
Other approaches
Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.
Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.
Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scraping methodology.

You can use the paralel execution. Devide the list of sites for e.g in ten TC that are going to use same code, just method names will be different (method1, method2,method3,...). You will increse the speed. Number of the browsers depends on your hardver performances.
See more on https://www.guru99.com/sessions-parallel-run-and-dependency-in-selenium.html
Main thing is to use Test NG and edit .xml file and set how many threads you want to use.Like this:
<suite name="TestSuite" thread-count="10" parallel="methods" >

If you are not scraping too security oriented website against bots, it is better to use Requests, it will reduce your time from days to couple of hours and implement multi-threading with multi-processing. Steps are too long to go over, here is just some idea:
def threader_run(data):
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
for i in data:
futures.append(executor.submit(scrapper,i))
for future in concurrent.futures.as_completed(futures):
print(future.result())
data = {}
data['process1'] = []
data['process2'] = []
data['process3'] = []
if __name__ == "__main__":
for x in data:
jobs = []
p = Process(target=threader_run,args(data[x],))
jobs.append(p)
p.start()
print(f'Started - {x}')
Basically, what this is doing is first have all the links compiled then split them into 3 arrays for running 3 processes simultaneously (you could run more processes depending on your cpu cores and how data intensive these jobs are). After that split those arrays further could be more than 10 even 100 depending on your project size. This will run threadpool which have maximum 8 workers and then it will run your final function.
Here with 3 process and 8 workers you are looking at 24 times speed boost. however, Use of Requests library is necessary if you use selenium for this, normal Computers/laptops will freeze. Because this would mean 24 browsers running simultaneously.

Related

How to run `selenium-chromedriver` in multiple threads

I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)
My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.
def thread_(self):
th = threading.Thread(target=self.main)
th.start()
My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.
I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!

Threading for selenium speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def create_driver():
"""returns a new chrome webdriver"""
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
return webdriver.Chrome(options=chromeOptions)
def get_title(url, webdriver=None):
"""get the url html title using BeautifulSoup
if driver is None uses a new chrome-driver and quit() after
otherwise uses the driver provided and don't quit() after"""
def print_title(driver):
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.find('title')
print(item.string.strip())
if webdriver:
print_title(webdriver)
else:
webdriver = create_driver()
print_title(webdriver)
webdriver.quit()
links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/",
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
Calling now get_tile on the links above.
Sequential approach
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time()
driver = create_driver()
for link in links: # could be 'like' clicks
get_title(link, driver)
driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")
Multiple threads approach
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time()
threads = []
for link in links: # each thread could be like a new 'click'
th = threading.Thread(target=get_title, args=(link,))
th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
threads.append(th)
for th in threads:
th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")
This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.
Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
Processes for selenium speed up
To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.
start_time = time.time()
processes = []
for link in links: # each thread a new 'click'
ps = multiprocessing.Process(target=get_title, args=(link,))
ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
processes.append(ps)
for ps in processes:
ps.join() # Main wait for processes finish
return (time.time() - start_time)
Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
Threading a good start for selenium speed up **
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).

try this:
def thread_(self):
th = threading.Thread(target=self.main)
self.jobs.append(th)
th.start()
info: https://pymotw.com/2/threading/

Best way to download files simultaneously with Python?

I'm trying to send simultaneous get requests with the Python requests module.
While searching for a solution I've come across lots of different approaches, including grequests, gevent.monkey, requests futures, threading, multi-processing...
I'm a little overwhelmed and not sure which one to pick, regarding speed and code-readibility.
The task is to download < 400 files as fast as possible, all from the same server. Ideally it should output the status for the downloads in terminal, e. g. print an error or success message per request.

def download(webpage):
requests.get(webpage)
# Whatever else you need to do to download your resource, put it in here
urls = ['https://www.example.com', 'https://www.google.com','https://yahoo.com'] # Populate with resources you wish to download
threads = {}
if __name__ == '__main__':
for i in urls:
print(i)
threads[i] = threading.Thread(target=download, args=(i,))
for i in threads:
threads[i].start()
for i in threads:
threads[i].join()
print('successfully done.')
The above code contains a function called download that represents whatever code you have to run to download the resource you're looking to download. Then a list is made populated with urls you wish to download - change these values as you please. This is assembled in to a second dictionary that contains the threads. This is so that you can have as many urls in the url dictionary as you want, and a separate thread is made for each of them. The threads are each started, then joined.

I would use threading as it is not necessary to run the downloads on multiple cores like multiprocessing does.
So write a function where requests.get() is in it and then start as a thread.
But remember that your internet connection has to be fast enough, otherwise it wouldn't be worth it.

Use concurrent futures with dictionary.items() to iterate through key, values

I want to scrape data from webpage in more efficient way. I read about concurrent futures but I have no idea how to use it in my script.
My function to take data from each link takes four arguments:
def scrape_data_for_offer(b, m, url, loc):
then it saves scraped data do pandas date frame.
It's called in a loop:
for link, location in cars_link_dict.items():
scrape_data_for_offer(brand, model, link, location)
and I want it to speed up this scraping process.
I tried to solve it like this:
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
executor.map(scrape_data_for_offer, brand, model, cars_link_dict.items())
But it doesn't work, do you have any ideas of how to solve this problem?

In your futures case, you're only passing three items. The last item is a two-element tuple. So, change your function to:
def scrape_data_for_offer(b,m,info):
url, loc = info
By the way, the words are "scrape", "scraped" and "scraping". Many, many, many people are using "scrp", "scrpped" and "scr*pping", but those words all refer to throwing things away.
As another by the way, the concurrent stuff is not really going to help you. I assume you are using BeautifulSoup for the scraping. BeautifulSoup is all Python code, and the global interpreter lock means that only one of the threads will be able to execute at any given time. You'll get a tiny bit of overlap while waiting for the web site responses to be delivered.
Also, running 50 workers is pointless unless you have 50 processors. They'll all fight for resources. If you have 8 processors, use about 12 workers. In most cases, you should just leave off that parameter; it will default to the number of processors in your machine.

Python: running multiple web requests in parallel

I'm new to Python and I have a basic question but I'm struggling to find an answer online, because a lot of the examples online seem to refer to deprecated APIs, so sorry if this has been asked before.
I'm looking for a way to execute multiple (similar) web requests in parallel, and retrieve the result in a list.
The synchronous version I have right now is something like:
urls = ['http://example1.org', 'http://example2.org', '...']
def getResult(urls):
result = []
for url in urls:
result.append(get(url).json())
return result
I'm looking for the asynchronous equivalent (where all the requests are made in parallel, but I then wait for all of them to be finished before returning the global result).
From what I saw I have to use async/await and aiohttp but the examples seemed way too complicated for the simple task I'm looking for.
Thanks

I am going to try to explain the simplest possible way to achieve what you want. Im sure there are more cleaner/better ways to do this but here it goes.
You could preform what you want using the python "threading" library. You can use it to create separate threads for each request and then run all the threads concurrently and get an answer.
Since you are new to python, to simplify things further I am using a global list called RESULTS to store in the results of the get(url) rather than returning them from the function.
import threading
RESULTS=[] #List to store the results
#Request Single Url Result and store in global RESULTS
def getSingleResult(url):
global RESULTS
RESULTS.append( ( url, get(url).json()) )
#Your Original Function
def getResult(urls)
ths=[]
for url in urls:
th=threading.Thread(target=getSingleResult, args=(url,)) #Create a Thread
th.start() #Start it
ths.append(th) #Add it to a thread list
for th in ths:
th.join() #Wait for all threads to finish
The usage of the global results is to make it easier rather than collecting results from the threads directly. If you wish to do that you can check out this answer How to get the return value from a thread in python?
Of course one thing to note that multi-threading in python doesnt provide true parallelism but rather concurrency especially if you are using the standard python implementation due to what is known as the Global Interpreter Lock
However for your use case it would still provide for you the speed up you need.

When doing network programming, is there a rule of thumb for determining how many threads to use?

Say I have a list of 1000 unique urls, and I need to open each one, and assert that something on the page is there. Doing this sequentially obviously is a poor choice, as most of the time the program will be sitting idle just waiting for a response. So, added in a thread pool where each worker reads from a main Queue, and opens a url to do a check. My question is, how big do I make the pool? Is it based on my network bandwidth, or some other metric? Are there any rules of thumb for this, or is it simply trial and error to find an effective size?
This is more of a theoretical question, but here's the basic outline of the code I'm using.
if __name__ == '__main__':
#get the stuff I've already checked
ID = 0
already_checked = [i[ID] for i in load_csv('already_checked.csv')]
#make sure I don't duplicate the effort
to_check = load_csv('urls_to_check.csv')
links = [url[:3] for url in to_check if i[ID] not in already_checked]
in_queue = Queue.Queue()
out_queue = Queue.Queue()
threads = []
for i in range(5):
t = SubProcessor(in_queue, out_queue)
t.setDaemon(True)
t.start()
threads.append(t)
writer = Writer(out_queue)
writer.setDaemon(True)
writer.start()
for link in links:
in_queue.put(link)

Your best bet is probably to write some code that runs some tests using the number of threads you specify, and see how many threads produce the best result. There are too many variables (speed of processor, speed of the buses, thread overhead, number of cores, and the nature of the code itself) for us to hazard a guess.

My experience (using .NET, but it should apply to any language) is that DNS resolution ends up being the limiting factor. I found that a maximum of 15 to 20 concurrent requests is all that I could sustain. DNS resolution is typically very fast, but sometimes can take hundreds of milliseconds. Without some custom DNS caching or other way to quickly do the resolution, I found that it averages about 50 ms.
If you can do multi-threaded DNS resolution, 100 or more concurrent requests is certainly possible on modern hardware (a quad-core machine). How your OS handles that many individual threads is another question entirely. But, as you say, those threads are mostly doing nothing but waiting for responses. The other consideration is how much work those threads are doing. If it's just downloading a page and looking for something specific, 100 threads is probably well within the bounds of reason. Provided that "looking" doesn't involve much more than just parsing an HTML page.
Other considerations involve the total number of unique domains you're accessing. If those 1,000 unique URLs are all from the different domains (i.e. 1,000 unique domains), then you have a worst case scenario: every request will require a DNS resolution (a cache miss).
If those 1,000 URLs represent only 100 domains, then you'll only have 100 cache misses. Provided that your machine's DNS cache is reasonable. However, you have another problem: hitting the same server with multiple concurrent requests. Some servers will be very unhappy if you make many (sometimes "many" is defined as "two or more") concurrent requests. Or too many requests over a short period of time. So you might have to write code to prevent multiple or more-than-X concurrent requests to the same server. It can get complicated.
One simple way to prevent the multiple requests problem is to sort the URLs by domain and then ensure that all the URLs from the same domain are handled by the same thread. This is less than ideal from a performance perspective, because you'll often find that one or two domains have many more URLs than the others, and you'll end up with most of the threads ended while those few are plugging away at their very busy domains. You can alleviate these problems by examining your data and assigning the threads' work items accordingly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.