I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?
Related
I have written a script where I am doing a monitor on some webpages and whenever there is a specific html tag found, it should print a notification. The point is to run the script 24/7 and while the script is running, I want to remove URL. I have currently a database where I am going to read the URLS that is being found/removed.
import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
URLS = [
'https://github.com/search?q=hello+world',
'https://github.com/search?q=python+3',
'https://github.com/search?q=world',
'https://github.com/search?q=i+love+python',
]
def doRequest(url):
while True:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
def sendNotifications(data):
...
if __name__ == '__main__':
# TODO read URLS from database instead of lists
for url in URLS:
threading.Thread(target=doRequest, args=(url,)).start()
The current problem im facing is that the doRequest is in a while loop which is running all the time and I wonder how can I remove a specific URL while the script is running inside a runnable script? e.g. https://github.com/search?q=world
Method 1: A simple approach
What you want is to insert some termination logic in the while True loop so that it constantly checks for a termination signal.
To this end, you can use threading.Event().
For example, you can add a stopping_event argument:
def doRequest(url, stopping_event):
while True and not stopping_event.is_set():
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
And you create these events when starting the threads
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
Whenever you want to stop/remove a particular url, you can just call
stopping_events[url].set()
That particular while loop will stop and exit.
You can even create a separate thread that waits for an user input to stop a particular url:
def manager(stopping_events):
while True:
url = input('url to stop: ')
if url in stopping_events:
stopping_events[url].set()
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
threading.Thread(target=manager, args=(stopping_events,)).start()
Method 2: A cleaner approach
Instead of having a fixed list of URLs, you can have a thread that keeps reading the list of URLs and feed it to the processing threads. This is the Producer-Consumer pattern. Now you don't really remove any URL. You simply keep processing the later list of URLs from the database. That should automatically take care of newly added/deleted URLs.
import queue
import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
def get_urls_from_db(q: queue.Queue):
while True:
url_list = ... # some db read logic
map(q.put, url_list) # putting newly read URLs into queue
def doRequest(q: queue.Queue):
while True:
url = q.get() # waiting and getting url from queue
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip(): # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)
def sendNotifications(data):
...
if __name__ == '__main__':
# TODO read URLS from database instead of lists
url_queue = queue.Queue()
for _ in range(10): # starts 10 threads
threading.Thread(target=doRequest, args=(url_queue,)).start()
threading.Thread(target=get_urls_from_db, args=(url_queue,)).start()
get_urls_from_db keeps reading URLs from database and adds the current list of URLs from database to the url_queue to be processed.
In doRequest, each iteration of the loop now grabs one url from the url_queue and processes it.
One thing to watch out for is adding URLs too quickly and processing can't keep up. Then the queue length will grow over time and consume lots of memory.
This is arguably better since now you do have great control over what URLs to process and have a fixed number of threads.
I'm writing a code in python that suppose to get all links that was founded in one url save them as key:value pairs (url:[links]) & then go over those links and do the same thing over & over until I got enough keys.
I've already done this with a list of threads and removing them from the list when they were done running but i want to use thread pool for easy maintenance
To do this i make 2 functions:
get the content from the url and return it
extract the links from the content and return them
Now I want to mange those tasks with thread pool but i don't know how to do it properly because i don't know how to control the flow.
I can extract the links only after the get request returned the html page.
these are the functions I will use:
def extract_links(response):
arr_starts = [m.start() for m in re.finditer('href="https://', response.content)]
arr_ends = []
links = []
for start in arr_starts:
end_index = response.content.find('"', start + 6)
arr_ends.append(end_index)
for i in range(len(arr_starts)):
link = response.content[arr_starts[i] + 6:arr_ends[i]]
links.append(link)
def get_page(url):
return requests.get(url)
and this is the code i did it the first time:
first_url = r'https://blablabla'
hash_links = {}
thread_list = []
web_crawl(first_url, hash_links)
while len(hash_links.keys()) < 30:
if len(thread_list) < MAX_THREAD_COUNT:
for urls in hash_links.values():
for url in urls:
if url not in hash_links:
new_tread = threading.Thread(target=web_crawl, args=(url, hash_links))
thread_list.append(new_tread)
new_tread.start()
new_tread.join()
else:
for t in thread_list:
if not t.isAlive():
t.handled = True
thread_list = [t for t in thread_list if not t.handled]
for key in hash_links.keys():
print key + ':'
for link in hash_links[key]:
print '----' + link
``
Your problem seems to be that of producing content from a URL and then processing the links found in that URL as keys, also scheduling those links for processing, but do that in parallel using the thread pool and semaphore objects.
If that is the case, I would point you to this article for semaphore objects and the thread pool.
Also, your problem sounds to me a lot like something that would benefit from a producer-consumer architecture, so I would also recommend this article.
I'm running a program to pull some info from Yahoo! Finance. It runs fine as a For loop, however it takes a long time (about 10 minutes for 7,000 inputs) because it has to process each request.get(url) individually (or am I mistaken on the major bottlenecker?)
Anyway, I came across multithreading as a potential solution. This is what I have tried:
import requests
import pprint
import threading
with open('MFTop30MinusAFew.txt', 'r') as ins: #input file for tickers
for line in ins:
ticker_array = ins.read().splitlines()
ticker = ticker_array
url_array = []
url_data = []
data_array =[]
for i in ticker:
url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/'+i+'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US®ion=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
url_array.append(url) #loading each complete url at one time
def fetch_data(url):
urlHandler = requests.get(url)
data = urlHandler.json()
data_array.append(data)
pprint.pprint(data_array)
threads = [threading.Thread(target=fetch_data, args=(url,)) for url in url_array]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
fetch_data(url_array)
The error I get is InvalidSchema: No connection adapters were found for '['https://query2.finance.... [url continues].
PS. I've also read that using multithread approach to scrape websites is bad/can get you blocked. Would Yahoo! Finance mind if I'm pulling data from a couple thousand tickers at once? Nothing happened when I did them sequentially.
If you look carefully at the error you will notice that it doesn't show one url but all the urls you appended, enclosed with brackets. Indeed the last line of your code actually call your method fetch_data with the full array as a parameter, which does't make sense. If you remove this last line the code runs just fine, and your threads are called as expected.
I have a file with 100,000 URLs that I need to request then process. The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that.
Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python?):
from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys
concurrent = 100
def worker():
while True:
url = q.get()
html = get_html(url)
process_html(html)
q.task_done()
def get_html(url):
try:
html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
return html
except:
print "error", url
return None
def process_html(html):
if html == None:
return
soup = BeautifulSoup(html)
text = soup.get_text()
# do some more processing
# write the text to a file
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=worker)
t.daemon = True
t.start()
try:
for url in open('text.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). It will give the same speed as if you were working with memory and not a file.
with open("test.txt") as f:
mmap_file = mmap.mmap(f.fileno(), 0)
# code that does what you need
mmap_file.close()
The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.
You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.
Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous