I have written a script where I am doing a monitor on some webpages and whenever there is a specific html tag found, it should print a notification. The point is to run the script 24/7 and while the script is running, I want to remove URL. I have currently a database where I am going to read the URLS that is being found/removed.
import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
URLS = [
'https://github.com/search?q=hello+world',
'https://github.com/search?q=python+3',
'https://github.com/search?q=world',
'https://github.com/search?q=i+love+python',
]
def doRequest(url):
while True:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
def sendNotifications(data):
...
if __name__ == '__main__':
# TODO read URLS from database instead of lists
for url in URLS:
threading.Thread(target=doRequest, args=(url,)).start()
The current problem im facing is that the doRequest is in a while loop which is running all the time and I wonder how can I remove a specific URL while the script is running inside a runnable script? e.g. https://github.com/search?q=world
Method 1: A simple approach
What you want is to insert some termination logic in the while True loop so that it constantly checks for a termination signal.
To this end, you can use threading.Event().
For example, you can add a stopping_event argument:
def doRequest(url, stopping_event):
while True and not stopping_event.is_set():
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
And you create these events when starting the threads
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
Whenever you want to stop/remove a particular url, you can just call
stopping_events[url].set()
That particular while loop will stop and exit.
You can even create a separate thread that waits for an user input to stop a particular url:
def manager(stopping_events):
while True:
url = input('url to stop: ')
if url in stopping_events:
stopping_events[url].set()
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
threading.Thread(target=manager, args=(stopping_events,)).start()
Method 2: A cleaner approach
Instead of having a fixed list of URLs, you can have a thread that keeps reading the list of URLs and feed it to the processing threads. This is the Producer-Consumer pattern. Now you don't really remove any URL. You simply keep processing the later list of URLs from the database. That should automatically take care of newly added/deleted URLs.
import queue
import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
def get_urls_from_db(q: queue.Queue):
while True:
url_list = ... # some db read logic
map(q.put, url_list) # putting newly read URLs into queue
def doRequest(q: queue.Queue):
while True:
url = q.get() # waiting and getting url from queue
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
def sendNotifications(data):
...
if __name__ == '__main__':
# TODO read URLS from database instead of lists
url_queue = queue.Queue()
for _ in range(10): # starts 10 threads
threading.Thread(target=doRequest, args=(url_queue,)).start()
threading.Thread(target=get_urls_from_db, args=(url_queue,)).start()
get_urls_from_db keeps reading URLs from database and adds the current list of URLs from database to the url_queue to be processed.
In doRequest, each iteration of the loop now grabs one url from the url_queue and processes it.
One thing to watch out for is adding URLs too quickly and processing can't keep up. Then the queue length will grow over time and consume lots of memory.
This is arguably better since now you do have great control over what URLs to process and have a fixed number of threads.
Related
I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?
First of all thank you for taking your time to read through this post. I'd like to begin that I'm very new to programming in general and that I seek advice to solve a problem.
I'm trying to create a script that checks if the content of a html page has been changed. I'm doing this to monitor certain website pages for changes. I've managed to find a script and I have made some alterations that it will go through a list of URL's checking if the page has been changed. The problem here is that its checking the page sequential. This means that it will go through the list checking the URL's one by one while I want the script to run the URL's parallel. I'm also using a while loop to continue checking the pages because even if a change took place it will still have to monitor the page. I could write a thousand more words on explaining what i'm trying to do so therefor have a look at the code:
import requests
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
url = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
i = 0
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
while True:
try:
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
print('checking')
time.sleep(10)
response = urlopen(url[i]).read()
newHash = hashlib.sha224(response).hexdigest()
i +=1
if newHash == currentHash:
continue
else:
print('Change detected')
print (url[i])
time.sleep(10)
continue
except Exception as e:
i = 0
print('resetting increment')
continue
What you want to do is called multi-threading.
Conceptually this is how it works:
import hashlib
import time
from urllib.request import urlopen
import threading
# Define a function for the thread
def f(url):
initialHash = None
while True:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
if not initialHash:
initialHash = currentHash
if currentHash != initialHash:
print('Change detected')
print (url)
time.sleep(10)
continue
return
# Create two threads as follows
for url in ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]:
t = threading.Thread(target=f, args=(url,))
t.start()
Running example of OP code Using Thread Executor
Code
import concurrent.futures
import time
import requests
import hashlib
from urllib.request import urlopen
def check_change(url):
'''
Checks for a change in web page contents by comparing current to previous hash
'''
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(10)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash != currentHash:
return "Change to:", url
else:
return None
except Exception as e:
return "Error", e, url
page_urls = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
while True:
# We can use a Thread Execution Manager to ensure threads are clean up properly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(check_change, url): url for url in page_urls}
for future in concurrent.futures.as_completed(future_to_url):
# Output result of each thread upon it's completion
url = future_to_url[future]
try:
status = future.result()
if status:
print(*status)
else:
print(f'No change to: {url}')
except Exception as exc:
print('Site %r generated an exception: %s' % (url, exc))
time.sleep(10) # Wait 10 seconds before rechecking sites
Output
Change to: https://www.google.com
Change to: https://www.google.be
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
...
As is:
I built a function that takes an url as argument, scrapes the page and puts the parsed info into a list. Next to this, I have a list of the urls and I'm mapping the list of urls to the url parser function and iterating through each url. The issue is that I have around 7000-8000 links so parsing iteratively takes a lot of time. This is the current iterative solution:
mapped_parse_links = map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
'parse' is the scraper function and 'my_new_list' is the list of URLs.
To be:
I want to implement multiprocessing so that instead of iterating through the list of URLs, it would utilize multiple CPUs to pick up more links at the same time and parse the info using the parse function. I tried the following:
import multiprocessing
with multiprocessing.Pool() as p:
mapped_parse_links = p.map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
I tried different solutions using the Pool function as well, however all of the solutions run for eternity. Can someone give me pointers on how to solve this?
Thanks.
Taken, with minor alterations, from the docs for concurrent.futures:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
if __name__ == '__main__':
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
# Do something with the scraped data here
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
You will have to substitute your parse function in for load_url.
I have written a program to crawl a single website and scrape certain data. I would like to speed up its execution by using ProcessingPoolExecutor. However, I am having trouble understanding how I can convert from single threaded to concurrent.
Specifically, when creating a job (via ProcessPoolExecutor.submit()), can I pass a class/object and args instead of a function and args?
And, if so, how do return data from those jobs to the queue for tracking visited pages AND a structure for holding scraped content?
I have been using this as a jumping off point, as well as reviewing the Queue and concurrent.futures docs (with, frankly, the latter going a bit over my head). I've also Googled/Youtubed/SO'ed around quite a bit to no avail.
from queue import Queue, Empty
from concurrent.futures import ProcessPoolExecutor
class Scraper:
"""
Scrapes a single url
"""
def __init__(self, url):
self.url = url # url of page to scrape
self.internal_urls = None
self.content = None
self.scrape()
def scrape(self):
"""
Method(s) to request a page, scrape links from that page
to other pages, and finally scrape actual content from the current page
"""
# assume that code in this method would yield urls linked in current page
self.internal_urls = set(scraped_urls)
# and that code in this method would scrape a bit of actual content
self.content = {'content1': content1, 'content2': content2, 'etc': etc}
class CrawlManager:
"""
Manages a multiprocess crawl and scrape of a single site
"""
def __init__(self, seed_url):
self.seed_url = seed_url
self.pool = ProcessPoolExecutor(max_workers=10)
self.processed_urls = set([])
self.queued_urls = Queue()
self.queued_urls.put(self.seed_url)
self.data = {}
def crawl(self):
while True:
try:
# get a url from the queue
target_url = self.queued_urls.get(timeout=60)
# check that the url hasn't already been processed
if target_url not in self.processed_urls:
# add url to the processed list
self.processed_urls.add(target_url)
print(f'Processing url {target_url}')
# passing an object to the
# ProcessPoolExecutor... can this be done?
job = self.pool.submit(Scraper, target_url)
"""
How do I 1) return the data from each
Scraper instance into self.data?
and 2) put scraped links to self.queued_urls?
"""
except Empty:
print("All done.")
except Exception as e:
print(e)
if __name__ == '__main__':
crawler = CrawlManager('www.mywebsite.com')
crawler.crawl()
For anyone who comes across this page, I was able figure this out for myself.
Per #brad-solomon's advice, I switched from ProcessPoolExecutor to ThreadPoolExecutor to manage the concurrent aspects of this script (see his comment for further details).
W.r.t. the original question, the key was to utilize the add_done_callback method of the ThreadPoolExecutor in conjunction with a modification to Scraper.scrape and a new method CrawlManager.proc_scraper_results as in the following:
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
class Scraper:
"""
Scrapes a single url
"""
def __init__(self, url):
self.url = url # url of page to scrape
self.internal_urls = None
self.content = None
self.scrape()
def scrape(self):
"""
Method(s) to request a page, scrape links from that page
to other pages, and finally scrape actual content from the current page
"""
# assume that code in this method would yield urls linked in current page
self.internal_urls = set(scraped_urls)
# and that code in this method would scrape a bit of actual content
self.content = {'content1': content1, 'content2': content2, 'etc': etc}
# these three items will be passed to the callback
# function with in a future object
return self.internal_urls, self.url, self.content
class CrawlManager:
"""
Manages a multiprocess crawl and scrape of a single website
"""
def __init__(self, seed_url):
self.seed_url = seed_url
self.pool = ThreadPoolExecutor(max_workers=10)
self.processed_urls = set([])
self.queued_urls = Queue()
self.queued_urls.put(self.seed_url)
self.data = {}
def proc_scraper_results(self, future):
# get the items of interest from the future object
internal_urls, url, content = future._result[0], future._result[1], future._result[2]
# assign scraped data/content
self.data[url] = content
# also add scraped links to queue if they
# aren't already queued or already processed
for link_url in internal_urls:
if link_url not in self.to_crawl.queue and link_url not in self.processed_urls:
self.to_crawl.put(link_url)
def crawl(self):
while True:
try:
# get a url from the queue
target_url = self.queued_urls.get(timeout=60)
# check that the url hasn't already been processed
if target_url not in self.processed_urls:
# add url to the processed list
self.processed_urls.add(target_url)
print(f'Processing url {target_url}')
# add a job to the ThreadPoolExecutor (note, unlike original question, we pass a method, not an object)
job = self.pool.submit(Scraper(target_url).scrape)
# to add_done_callback we send another function, this one from CrawlManager
# when this function is itself called, it will be pass a `future` object
job.add_done_callback(self.proc_scraper_results)
except Empty:
print("All done.")
except Exception as e:
print(e)
if __name__ == '__main__':
crawler = CrawlManager('www.mywebsite.com')
crawler.crawl()
The result of this is a very significant reduction in duration of this program.
I am writing a tool which fetches multiple HTML files and processes them as text:
for url in url_list:
url_response = requests.get(url)
text = url_response.text
# Process text here (put in database, search, etc)
The problem is that this is pretty slow. If I just needed a simple resonse I could have used grequests, but since I need to get the content of the HTML file, that seems not to be an option. How can I fasten this up?
Thanks in regard!
import requests
from multiprocessing import Pool
def process_html(url):
url_response = requests.get(url)
text = url_response.text
print(text[:500])
print('-' * 30)
urls = [
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
]
with Pool(None) as p: #None => uses cpu.count()
p.map(process_html, urls) #This blocks until all return values from process_html() have been collected.
Use a thread for each request:
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
You need to use threading and put requests.get(...) to fetch each URL in diff threads i.e. in parallel.
See these two answers on SO for example and usage:
Python - very simple multithreading parallel URL fetching (without queue)
Multiple requests using urllib2.urlopen() at the same time
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()