I'm writing a code in python that suppose to get all links that was founded in one url save them as key:value pairs (url:[links]) & then go over those links and do the same thing over & over until I got enough keys.
I've already done this with a list of threads and removing them from the list when they were done running but i want to use thread pool for easy maintenance
To do this i make 2 functions:
get the content from the url and return it
extract the links from the content and return them
Now I want to mange those tasks with thread pool but i don't know how to do it properly because i don't know how to control the flow.
I can extract the links only after the get request returned the html page.
these are the functions I will use:
def extract_links(response):
arr_starts = [m.start() for m in re.finditer('href="https://', response.content)]
arr_ends = []
links = []
for start in arr_starts:
end_index = response.content.find('"', start + 6)
arr_ends.append(end_index)
for i in range(len(arr_starts)):
link = response.content[arr_starts[i] + 6:arr_ends[i]]
links.append(link)
def get_page(url):
return requests.get(url)
and this is the code i did it the first time:
first_url = r'https://blablabla'
hash_links = {}
thread_list = []
web_crawl(first_url, hash_links)
while len(hash_links.keys()) < 30:
if len(thread_list) < MAX_THREAD_COUNT:
for urls in hash_links.values():
for url in urls:
if url not in hash_links:
new_tread = threading.Thread(target=web_crawl, args=(url, hash_links))
thread_list.append(new_tread)
new_tread.start()
new_tread.join()
else:
for t in thread_list:
if not t.isAlive():
t.handled = True
thread_list = [t for t in thread_list if not t.handled]
for key in hash_links.keys():
print key + ':'
for link in hash_links[key]:
print '----' + link
``
Your problem seems to be that of producing content from a URL and then processing the links found in that URL as keys, also scheduling those links for processing, but do that in parallel using the thread pool and semaphore objects.
If that is the case, I would point you to this article for semaphore objects and the thread pool.
Also, your problem sounds to me a lot like something that would benefit from a producer-consumer architecture, so I would also recommend this article.
Related
I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?
I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code.
The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856)
Using Beautiful soup and regex to get the data as JSON.
The output is just a simple JSON.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
#bgp.he.net is filtered by user-agent
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries/')):
pages.append(link.get('href'))
return pages
def get_each_sites(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
#print(columns)
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(SITE + '/AS' + current_asn)
s = str(url_to_soup(SITE + '/AS' + current_asn))
asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups()
#print(asn_code[2:])
#print(name)
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
#print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
if mtch:
ip = mtch.group("IP").strip()
#print(ip)
mappings[asn_code[2:]] = {'Country': country,
'Name': name,
'Registry': registry,
'num_ip_addresses': ip}
return mappings
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)
The output is as expected, but super slow.
You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that
You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner
What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN
I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:
Multiprocessing Spider
Speed up Beautiful soup scraper
I have a list of links and I create a Beautiful Soup object for each link and scrape all the links within paragraph tags from the page. Because I have hundreds of links I'd like to scrape from, a single process would take more time than I'd like so multiprocessing seems to be the ideal solution.
Here's my code:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Queue
urls = ['https://hbr.org/2011/05/the-case-for-executive-assistants','https://signalvnoise.com/posts/3450-when-culture-turns-into-policy']
def collect_links(urls):
extracted_urls = []
bsoup_objects = []
p_tags = [] #store language between paragraph tags in each beautiful soup object
workers = 4
processes = []
links = Queue() #store links extracted from urls variable
web_connection = Queue() #store beautiful soup objects that are created for each url in urls variable
#dump each url from urls variable into links Queue for all processes to use
for url in urls:
links.put(url)
for w in xrange(workers):
p = Process(target = create_bsoup_object, args = (links, web_connection))
p.start()
processes.append(p)
links.put('STOP')
for p in processes:
p.join()
web_connection.put('STOP')
for beaut_soup_object in iter(web_connection.get, 'STOP'):
p_tags.append(beaut_soup_object.find_all('p'))
for paragraphs in p_tags:
bsoup_objects.append(BeautifulSoup(str(paragraphs)))
for beautiful_soup_object in bsoup_objects:
for link_tag in beautiful_soup_object.find_all('a'):
extracted_urls.append(link_tag.get('href'))
return extracted_urls
def create_bsoup_object(links, web_connection):
for link in iter(links.get, 'STOP'):
try:
web_connection.put(BeautifulSoup(requests.get(link, timeout=3.05).content))
except requests.exceptions.Timeout as e:
#client couldn't connect to server or return data in time period specified in timeout parameter in requests.get()
pass
except requests.exceptions.ConnectionError as e:
#in case of faulty url
pass
except Exception, err:
#catch regular errors
print(traceback.format_exc())
pass
except requests.exceptions.HTTPError as e:
pass
return True
And when I run collect_links(urls), rather than getting a list of links, I get an empty list with the following error:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
send(obj)
RuntimeError: maximum recursion depth exceeded while calling a Python object
[]
I'm not sure what that's referring to. I read somewhere that Queues work best with simple objects. Does the size of the beautiful soup objects I'm storing in them have anything to do with this? I would appreciate any insight.
The objects that you place on the queue need to be pickleable. Eg.
import pickle
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://httpbin.org').text)
print type(soup)
p = pickle.dumps(soup)
This code raises RuntimeError: maximum recursion depth exceeded while calling a Python object.
Instead you could put the actual HTML text on the queue, and pass that through BeautifulSoup in the main thread. This will still improve performance as your application is likely to be I/O bound due to its networking component.
Do this in create_bsoup_object():
web_connection.put(requests.get(link, timeout=3.05).text)
which will add the HTML onto the queue instead of the BeautifulSoup object. Then parse the HTML in the main process.
Alternatively parse and extract the URLs in the child processes, and put the extracted_urls on the queue.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have a small python program made by myself which scrape a website for some prices. I am using beautifulsoup 4 and python threading module.
The problem is I dont know how to "control" the threads. As you can see from the code, I made subclass of threading class (something like consumer, producer). In one class I am taking links from the pages, and in the other I am looking for some classes at the html with BS4 and writing to the major file.
When I start the script, I am normally starting with Thread 1. I am scraping for every link at the website, taking name and article price. For every link, I am making thread. As the website has many links (around 3000), after some time, I have that much threads which are killing my computer. Python.exe is around 2 GB and I have to kill the program.
This is my fourth day trying to find a solution...... Please.... :)
If I get it right: setDaemon(true) - the program kills them after execution, .join() is waiting to complete the thread.
I am totally beginner in the programming and also aware that the code is little messy. Any suggestions are welcome.
Dont worry about last few try blocks. Its just for the fun.
Thank you!
import threading
import csv
import urllib2
import time
from bs4 import BeautifulSoup
import re
import Queue
httpLink = "WWW.SOMEWEBSITE.COM"
fn = 'J:\\PRICES\\'
queue = Queue.Queue()
soup_queue = Queue.Queue()
brava = threading.Lock()
links = []
brokenLinks = []
pageLinks = []
fileName = time.strftime("%d_%m_%Y-%H_%M")
class TakeURL(threading.Thread):
def __init__(self, queue, soup_queue):
threading.Thread.__init__(self)
self.queue = queue
self.soup_queue = soup_queue
def run(self):
while True:
host = self.queue.get()
try:
url = urllib2.urlopen(host)
chunk = url.read()
except:
print ("Broken link " + host)
writeCSV("BrokenLinks.csv", "ab", host)
brokenLinks.append(host)
time.sleep(30)
writeCSV('Links.csv','ab',host)
if ("class=\"price\"" in chunk):
self.soup_queue.put(chunk)
else:
writeCSV("LinksWithoutPrice.csv", "ab", host)
try:
findLinks(chunk, "ul", "mainmenu")
except:
print ("Broken Link" + host)
writeCSV("BrokenLinks.csv", "ab", host)
brokenLinks.append(host)
time.sleep(30)
self.queue.task_done()
class GetDataURL(threading.Thread):
getDataUrlLock = threading.Lock()
def __init__ (self, soup_queue):
threading.Thread.__init__(self)
self.soup_queue = soup_queue
def run(self):
while True:
chunk = self.soup_queue.get()
soup = BeautifulSoup(chunk)
dataArticle = soup.findAll("tr",{"class":""})
pagination = soup.findAll("a",{"class":"page"})
self.getDataUrlLock.acquire()
f = open(fn + fileName + ".csv", "ab")
filePrice = csv.writer(f)
for groupData in dataArticle:
for articleName in groupData.findAll("a",{"class":"noFloat"}):
fullName = articleName.string.encode('utf-8')
print (fullName)
for articlePrice in groupData.findAll("div", {"class":"price"}):
if (len(articlePrice) > 1):
fullPrice = articlePrice.contents[2].strip()
else:
fullPrice = articlePrice.get_text().strip()
print (fullPrice[:-12])
print ('-')*80
filePrice.writerow([fullName, fullPrice[:-12]])
f.close()
for page in pagination:
pageLink = page.get('href')
pageLinks.append('http://www.' + pageLink[1:])
self.getDataUrlLock.release()
self.soup_queue.task_done()
def writeCSV(fileName, writeMode, link):
try:
brava.acquire()
f = csv.writer(open(fn + fileName,writeMode))
f.writerow([link])
except IOError as e:
print (e.message)
finally:
brava.release()
def findLinks(chunk, tagName, className):
soup = BeautifulSoup(chunk)
mainmenu = soup.findAll(tagName,{"class":className})
for mm in mainmenu:
for link in mm.findAll('a'):
href = link.get('href')
links.insert(0,href)
print (href)
print ('-')*80
def startMain(links):
while (links):
#time.sleep(10)
threadLinks = links[-10:]
print ("Alive Threads: " + str(threading.activeCount()))
#time.sleep(1)
for item in range(len(threadLinks)):
links.pop()
for i in range(len(threadLinks)):
tu = TakeURL(queue, soup_queue)
tu.setDaemon(True)
tu.start()
for host in threadLinks:
queue.put(host)
for i in range(len(threadLinks)):
gdu = GetDataURL(soup_queue)
gdu.setDaemon(True)
gdu.start()
queue.join()
soup_queue.join()
if __name__ == "__main__":
start = time.time()
httpWeb = urllib2.urlopen(httpLink)
chunk = httpWeb.read()
findLinks(chunk, 'li','tab')
startMain(links)
pageLinks = list(set(pageLinks))
startMain(pageLinks)
startMain(brokenLinks)
print ('-') * 80
print ("Seconds: ") % (time.time() - start)
print ('-') * 80
Your thread never returns anything, so it never stops; just continually runs the while loop. And since you're starting a new thread for each link, you eventually just keep adding on more and more threads while previous threads may not be doing anything. You essentially wouldn't need a queue with the way you have it. This approach can cause problems with a large number of jobs, as you're noticing.
worker = GetDataURL()
worker.start()
really points to GetDataURL.run()...which is an infinite while loop.
Same is true for TakeURL.start().
You could go a couple routes
1) Just take the while out of your thread, do away with the queues and return the result at the end of the run definition. This way each thread has 1 task, returns the results, then stops. Not the most efficient but would require the least amount of code modification.
2) In your startMain, outside of the while loop, start a group of say 10 threads (i.e. a thread pool). These 10 threads will always run, and instead of starting a new thread for each link, just put the link in the queue. When a thread is available, it will run the next item in queue. But you still need to manage the cleanup of these threads.
3) You could rework your code a bit more and make use of built in functions like Thread Pools and Process Pools. I've posted on Process Pools before: SO MultiProcessing
With this method, you can forget all the mess associated with locks too. After each pool.map (or whatever you use) you can right that chunk of information to the file in your startMain code. Cleans things up a lot.
Hopefully that makes some sense. I chose not to modify your code cause I think it's worth you experimenting with the options and choosing a direction.
The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.
You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.
Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous