Multiprocessing beautifulsoup4 function to increase performance - python

As is:
I built a function that takes an url as argument, scrapes the page and puts the parsed info into a list. Next to this, I have a list of the urls and I'm mapping the list of urls to the url parser function and iterating through each url. The issue is that I have around 7000-8000 links so parsing iteratively takes a lot of time. This is the current iterative solution:
mapped_parse_links = map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
'parse' is the scraper function and 'my_new_list' is the list of URLs.
To be:
I want to implement multiprocessing so that instead of iterating through the list of URLs, it would utilize multiple CPUs to pick up more links at the same time and parse the info using the parse function. I tried the following:
import multiprocessing
with multiprocessing.Pool() as p:
mapped_parse_links = p.map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
I tried different solutions using the Pool function as well, however all of the solutions run for eternity. Can someone give me pointers on how to solve this?
Thanks.

Taken, with minor alterations, from the docs for concurrent.futures:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
if __name__ == '__main__':
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
# Do something with the scraped data here
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
You will have to substitute your parse function in for load_url.

Related

How to remove a URL monitoring while script is running?

I have written a script where I am doing a monitor on some webpages and whenever there is a specific html tag found, it should print a notification. The point is to run the script 24/7 and while the script is running, I want to remove URL. I have currently a database where I am going to read the URLS that is being found/removed.
import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
URLS = [
'https://github.com/search?q=hello+world',
'https://github.com/search?q=python+3',
'https://github.com/search?q=world',
'https://github.com/search?q=i+love+python',
]
def doRequest(url):
while True:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
def sendNotifications(data):
...
if __name__ == '__main__':
# TODO read URLS from database instead of lists
for url in URLS:
threading.Thread(target=doRequest, args=(url,)).start()
The current problem im facing is that the doRequest is in a while loop which is running all the time and I wonder how can I remove a specific URL while the script is running inside a runnable script? e.g. https://github.com/search?q=world
Method 1: A simple approach
What you want is to insert some termination logic in the while True loop so that it constantly checks for a termination signal.
To this end, you can use threading.Event().
For example, you can add a stopping_event argument:
def doRequest(url, stopping_event):
while True and not stopping_event.is_set():
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
And you create these events when starting the threads
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
Whenever you want to stop/remove a particular url, you can just call
stopping_events[url].set()
That particular while loop will stop and exit.
You can even create a separate thread that waits for an user input to stop a particular url:
def manager(stopping_events):
while True:
url = input('url to stop: ')
if url in stopping_events:
stopping_events[url].set()
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
threading.Thread(target=manager, args=(stopping_events,)).start()
Method 2: A cleaner approach
Instead of having a fixed list of URLs, you can have a thread that keeps reading the list of URLs and feed it to the processing threads. This is the Producer-Consumer pattern. Now you don't really remove any URL. You simply keep processing the later list of URLs from the database. That should automatically take care of newly added/deleted URLs.
import queue
import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
def get_urls_from_db(q: queue.Queue):
while True:
url_list = ... # some db read logic
map(q.put, url_list) # putting newly read URLs into queue
def doRequest(q: queue.Queue):
while True:
url = q.get() # waiting and getting url from queue
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
def sendNotifications(data):
...
if __name__ == '__main__':
# TODO read URLS from database instead of lists
url_queue = queue.Queue()
for _ in range(10): # starts 10 threads
threading.Thread(target=doRequest, args=(url_queue,)).start()
threading.Thread(target=get_urls_from_db, args=(url_queue,)).start()
get_urls_from_db keeps reading URLs from database and adds the current list of URLs from database to the url_queue to be processed.
In doRequest, each iteration of the loop now grabs one url from the url_queue and processes it.
One thing to watch out for is adding URLs too quickly and processing can't keep up. Then the queue length will grow over time and consume lots of memory.
This is arguably better since now you do have great control over what URLs to process and have a fixed number of threads.

Mulithreading hangs when used with requests module and large number of threads

I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?

threading: function seems to run as a blocking loop although i am using threading

I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.
Here is the code:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
]
def get_url(url):
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
a = soup.select_one('a')
print(a)
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
results = {executor.submit( get_url(url)) : url for url in urls}
for future in concurrent.futures.as_completed(results):
try:
pass
except Exception as exc:
print('ERROR for symbol:', results[future])
print(exc)
However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.
Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.
for u in urls:
get_url(u)
I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.
I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.
You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit, experiencing the concurrent.futures analog to this problem with raw threading.Thread usage. Change:
results = {executor.submit( get_url(url)) : url for url in urls}
to:
results = {executor.submit(get_url, url) : url for url in urls}
so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.

Scraping json content from a site ordered in pages

I'm trying to scrape a site, when I run the following code without region_id=[any number from one to 32] I get a [500], but if I set region_id=1 I'll get only a first page by default (on the url it is pagina=&), pages are up to 500; is there a command or parameter for retrieving every page (every possible value of pagina=), avoiding for loops?
import requests
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas&region_id=&parent_id=&pagina=&nombre="
resp = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()
Even without a for loop, you are still going to need iteration. You could do it with recursion or map as I've done below, but the iteration is still there. This solution has the advantage that everything is a generator, so only when you ask for a page's json from all_data will url be formatted, the request made, checked and converted to json. I added a filter to make sure you got a valid response before trying to get the json out. It still makes every request sequentially, but you could replace map with a parallel implementation quite easily.
import requests
from itertools import product, starmap
from functools import partial
def is_valid_resp(resp):
return resp.status_code == requests.codes.ok
def get_json(resp):
return resp.json()
# There's a .format hiding on the end of this really long url,
# with {} in appropriate places
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas&region_id={}&parent_id=&pagina={}&nombre=".format
regions = range(1, 33)
pages = range(1, 501)
urls = starmap(url, product(regions, pages))
moz_get = partial(requests.get, headers={'User-Agent':'Mozilla/5.0'})
responses = map(moz_get, urls)
valid_responses = filter(is_valid_response, responses)
all_data = map(get_json, valid_responses)
# all_data is a generator that will give you each page's json.

How do I use multiprocessing to extract links from webpages with Beautiful Soup?

I have a list of links and I create a Beautiful Soup object for each link and scrape all the links within paragraph tags from the page. Because I have hundreds of links I'd like to scrape from, a single process would take more time than I'd like so multiprocessing seems to be the ideal solution.
Here's my code:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Queue
urls = ['https://hbr.org/2011/05/the-case-for-executive-assistants','https://signalvnoise.com/posts/3450-when-culture-turns-into-policy']
def collect_links(urls):
extracted_urls = []
bsoup_objects = []
p_tags = [] #store language between paragraph tags in each beautiful soup object
workers = 4
processes = []
links = Queue() #store links extracted from urls variable
web_connection = Queue() #store beautiful soup objects that are created for each url in urls variable
#dump each url from urls variable into links Queue for all processes to use
for url in urls:
links.put(url)
for w in xrange(workers):
p = Process(target = create_bsoup_object, args = (links, web_connection))
p.start()
processes.append(p)
links.put('STOP')
for p in processes:
p.join()
web_connection.put('STOP')
for beaut_soup_object in iter(web_connection.get, 'STOP'):
p_tags.append(beaut_soup_object.find_all('p'))
for paragraphs in p_tags:
bsoup_objects.append(BeautifulSoup(str(paragraphs)))
for beautiful_soup_object in bsoup_objects:
for link_tag in beautiful_soup_object.find_all('a'):
extracted_urls.append(link_tag.get('href'))
return extracted_urls
def create_bsoup_object(links, web_connection):
for link in iter(links.get, 'STOP'):
try:
web_connection.put(BeautifulSoup(requests.get(link, timeout=3.05).content))
except requests.exceptions.Timeout as e:
#client couldn't connect to server or return data in time period specified in timeout parameter in requests.get()
pass
except requests.exceptions.ConnectionError as e:
#in case of faulty url
pass
except Exception, err:
#catch regular errors
print(traceback.format_exc())
pass
except requests.exceptions.HTTPError as e:
pass
return True
And when I run collect_links(urls), rather than getting a list of links, I get an empty list with the following error:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
send(obj)
RuntimeError: maximum recursion depth exceeded while calling a Python object
[]
I'm not sure what that's referring to. I read somewhere that Queues work best with simple objects. Does the size of the beautiful soup objects I'm storing in them have anything to do with this? I would appreciate any insight.
The objects that you place on the queue need to be pickleable. Eg.
import pickle
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://httpbin.org').text)
print type(soup)
p = pickle.dumps(soup)
This code raises RuntimeError: maximum recursion depth exceeded while calling a Python object.
Instead you could put the actual HTML text on the queue, and pass that through BeautifulSoup in the main thread. This will still improve performance as your application is likely to be I/O bound due to its networking component.
Do this in create_bsoup_object():
web_connection.put(requests.get(link, timeout=3.05).text)
which will add the HTML onto the queue instead of the BeautifulSoup object. Then parse the HTML in the main process.
Alternatively parse and extract the URLs in the child processes, and put the extracted_urls on the queue.

Categories

Resources