Problems with Speed during web-crawling (Python)

Problems with Speed during web-crawling (Python) - python

I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code:
from eventlet.green import urllib2
import httplib
import time
import eventlet
# Create the URLS in groups of 400 (+- max for eventlet)
def web_CreateURLS():
print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
for var_indexURLS in xrange(0, 2000000, 400):
var_URLS = []
for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
var_URLS.append("http://www.nu.nl")
web_ScanURLS(var_URLS)
# Return the HTML Source per URL
def web_ReturnHTML(url):
try:
return [urllib2.urlopen(url[0]).read(), url[1]]
except urllib2.URLError:
time.sleep(10)
print "UrlError"
web_ReturnHTML(url)
# Analyse the HTML Source
def web_ScanURLS(var_URLS):
pool = eventlet.GreenPool()
try:
for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
# do something etc..
except TypeError: pass
web_CreateURLS()

I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes.
Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel.
I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily.
And then of course there is always redis pub/sub or sync. :)
"Share the load"

Related

Send lot of requests to several equipments at the same time

I’m currently working on a project and I need to fetch data from several switches by sending SSH requests as follow:
Switch 1 -> 100 requests
Switch 2 -> 500 requests
Switch 3 -> 1000 requests
…
Switch 70 -> 250 requests
So several requests (5500 in total) spread over 70 switches.
Today, I am using a json file built like this:
{
"ip_address1":
[
{"command":"command1"},
{"command":"command2"},
...
{"command":"command100"}
],
"ip_address2":
[
{"command":"command1"},
{"command":"command2"},
...
{"command":"command100"}
],
…
"ip_address70":
[
{"command":"command1"},
{"command":"command2"},
...
{"command":"command100"}
],
}
Each command is a CLI command to a switch which I’m connecting on by SSH.
Today, I’m using Python with multi threading with 8 workers because I have only 4 CPUs.
The total of the script make 1 hour to proceed so it’s too long.
Is there a way to drastically speed up this process please?
A friend told me about Golang channels and go routines but I’m not sure if it’s interesting to move from Python to Go if there’s no difference about the time.
Can you please give me some advices?
Thank you very much,

Python offers a pretty straight forward multiprocessing library. Especially for a straight forward task like yours I would stick to the language I am the most comfortable with.
In python you would basically generate a list from your list of commands and ip addresses.
Using an example straight from the documentation: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool
With the pool.map function from the multiprocessing module, you can pass each element from your list to a function, where you can pass your commands to the servers. You might want to have another look at the different mapping functions provided for the pool module.
from multiprocessing import Pool, TimeoutError
import os
def execute_ssh(address_command_mapping):
# add your logic to pass the commands to the corresponding IP address
return
if __name__ == '__main__':
# assuming your ip addresses are stored in a json file
with open("ip_addresses.json", "r") as file:
ip_addresses = json.load(file)
# transforming the address dictionary to a list of dictionaries
address_list = [{ip: commands} for ip, commands in ip_addresses.items()]
# start 4 worker processes
with Pool(processes=4) as pool:
# pool.map passes each element to the 'execute_ssh' function
pool.map(execute_ssh, address_list)

Thank you Leon,
Is the pool.map function working the same way as the thread pool executor module?
Here is what I’m using:
from concurrent.futures import ThreadPoolExecutor
def task(n):
// sending command
def main():
print("Starting ThreadPoolExecutor")
with ThreadPoolExecutor(max_workers=3) as executor:
for element in mylist:
executor.submit(task, (element))
print("All tasks complete")
if __name__ == '__main__':
main()
So is it working the same way?
Thank you

Web crawler returning list vs generator vs producer/consumer

I want to recursively crawl a web-server that hosts thousands of files and then check if they are different from what's in the local repository (this is a part of checking the delivery infrastructure for bugs).
So far I've been playing around with various prototypes and here is what I noticed. If I do a straightforward recursion and put all the files into a list, the operation completes in around 230 seconds. Note that I make only one request per directory, so it makes sense to actually download the files I'm interested in elsewhere:
def recurse_links(base):
result = []
try:
f = urllib.request.urlopen(base)
soup = BeautifulSoup(f.read(), "html.parser")
for anchor in soup.find_all('a'):
href = anchor.get('href')
if href.startswith('/') or href.startswith('..'):
pass
elif href.endswith('/'):
recurse_links(base + href)
else:
result.append(base + href)
except urllib.error.HTTPError as httperr:
print('HTTP Error in ' + base + ': ' + str(httperr))
I figured, if I could start processing the files I'm interested in while the crawler is still working, I could save time. So the next thing I tried was a generator that could be further used as a coroutine. The generator took 260 seconds, slightly more, but still acceptable. Here's the generator:
def recurse_links_gen(base):
try:
f = urllib.request.urlopen(base)
soup = BeautifulSoup(f.read(), "html.parser")
for anchor in soup.find_all('a'):
href = anchor.get('href')
if href.startswith('/') or href.startswith('..'):
pass
elif href.endswith('/'):
yield from recurse_links_gen(base + href)
else:
yield base + href
except urllib.error.HTTPError as http_error:
print(f'HTTP Error in {base}: {http_error}')
Update
Answering some questions that came up in the comments section:
I've got roughly 370k files, but not all of them will make it to the next step. I will check them against a set or dictionary (to get O(1) lookup) before going ahead and compare them to local repo
After more tests it looks like sequential crawler takes less time in roughly 4 out of 5 attempts. And generator took less time once. So at this point is seems like generator is okay
At this point consumer doesn't do anything other than get an item from queue, since it's a concept. However I have flexibility in what I will do with the file URL I get from producer. I can for instance, download only first 100KB of file, calculate it's checksum while in memory and then compare to a pre-calculated local version. What's clear though is that if simply adding thread creation bumps my execution time by a factor of 4 to 5, adding work on consumer thread will not make it any faster.
Finally I decided to give producer/consumer/queue a shot and a simple PoC ran 4 times longer while loading 100% of one CPU core. Here is the brief code (the crawler is the same generator-based crawler from above):
class ProducerThread(threading.Thread):
def __init__(self, done_event, url_queue, crawler, name):
super().__init__()
self._logger = logging.getLogger(__name__)
self.name = name
self._queue = url_queue
self._crawler = crawler
self._event = done_event
def run(self):
for file_url in self._crawler.crawl():
try:
self._queue.put(file_url)
except Exception as ex:
self._logger.error(ex)
So here are my questions:
Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores?
I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided?
Is the generator slower because it has to save the function context and then load it again over and over?
What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?

1) Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores?
Yes, these are the threads, but to utilize multiple cores of your CPU, you need to use multiprocessing package.
2) I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided?
It depends on the number of threads you are created, one reason may be due to context switches, your threads are making. The optimum value for thread should be 2/3, i.e create 2/3 threads and check the performance again.
3) Is the generator slower because it has to save the function context and then load it again over and over?
Generators are not slow, it is rather good for the problem you are working on, as you find a url , you put that into queue.
4) What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?
Create a ConsumerThread class, which fetches the data(url in your case) from the queue and start working on it.

Parallelizing loop for downloading data

I'm new to Python. I want to run a simple script in Google App Engine that retrieves many files into an object as quickly as possible. Would parallelization be a smart option and how would I go about doing it? Thanks in advance for the brainstorming
import requests
...
theData=[]
for q in range(0, len(theURLs)):
r = requests.get(theURLs[q])
theData.insert(q,r.text)

In "regular" Python this is pretty simple.
from multiprocessing.pool import ThreadPool
import requests
responses = ThreadPool(10).map(requests.get, urls)
Replace 10 with # of threads that produces best results for you.
However you specified GAE which has restrictions on spawning threads/processes and its own async approach, which consists of using the async functions from the URL Fetch service, something along these lines (untested):
rpcs = [urlfetch.create_rpc() for url in urls]
for (rpc, url) in zip(rpcs, urls):
urlfetch.make_fetch_call(rpc, url)
results = [rpc.get_result() for rpc in rpcs]
You will need to add error handling...

You should make your code more Pythonic by using list comprehensions:
# A list of tuples
theData = [(q,requests.get(theURLs[q]).text) for q in range(0, len(theURLs))]
# ... or ...
# A list of lists
theData = [[q,requests.get(theURLs[q]).text] for q in range(0, len(theURLs))]
If you want to retrieve the files concurrently use the threading library, this website has some good examples, might be good practice:
http://www.tutorialspoint.com/python/python_multithreading.htm

I seriously doubt it. Parallelization can really only speed up calculations, while the bottleneck here is data transfer.

gevent / requests hangs while making lots of head requests

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.
Please take a look at my code snippet below, and let me know what I should change.
import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests
def get_head(url, timeout=3):
try:
return requests.head(url, allow_redirects=True, timeout=timeout)
except:
return None
def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
results = {}
for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
print '\t%d. processing %d urls # %s' % (i, chunk_size, str(datetime.datetime.now()))
jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
gevent.joinall(jobs, timeout=timeout)
results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
return results
I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.

The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter
whether I used a gevent Pool (i.e. one greenlet per connection) or a fixed set of greenlets, all requesting multiple URLs
how large I set the pool size
In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced
result = requests.head(url)
with
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)
Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:
def gethead(url):
result = None
try:
with Timeout(5, False):
result = requests.head(url)
except Exception as e:
result = e
return result
Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!
Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:
def qworker(qin, qout):
while True:
try:
qout.put(gethead(qin.get(block=False)))
except Empty:
break
qin = Queue()
qout = Queue()
for url in urls:
qin.put(url)
workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]
Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:
# netstat -tpn | wc -l
10074
That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.
Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.
In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.
I should address your two main questions, specifically:
I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.
your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.
your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.

I'm not sure if this will resolve your issue, but you are not using pool.Pool() correctly.
Try this:
def expand_short_urls(short_urls, chunk_size=100):
# Pool() automatically limits your process to chunk_size greenlets running concurrently
# thus you don't need to do all that chunking business you were doing in your for loop
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
# spawn() (both gevent.spawn() and Pool.spawn()) returns a gevent.Greenlet object
# NOT the value your function, get_head, will return
threads = [p.spawn(get_head, short_url) for short_url in short_urls]
p.join()
# to access the returned value of your function, access the Greenlet.value property
results = {short_url: thread.value.url for short_url, thread in zip(short_urls, threads)
if thread.value is not None and thread.value.status_code == 200}
return results

Asynchronous DNS resolver testing

I want to test a large number of IPs to look for open DNS resolvers. I'm trying to find the most efficient way to parallelize this. At the moment I'm trying to accomplish this with twisted. I want to have 10 or 20 parallel threads sending a query to avoid blocking trough timeouts.
Twisted has a DNSDatagramProtocol that seems suitable but I just can't figure out how to put it together with the twisted "reactor" and "threads" facilities to make it run efficiently.
I read a lot of the twisted documentation but I'm still not sure what would be the best way to do it.
Could someone give an example how this can be accomplished?

Here's a quick example demonstrating the Twisted Names API:
from sys import argv
from itertools import cycle
from pprint import pprint
from twisted.names import client
from twisted.internet.task import react
from twisted.internet.defer import gatherResults, inlineCallbacks
def query(reactor, server, name):
# Create a new resolver that uses the given DNS server
resolver = client.Resolver(
resolv="/dev/null", servers=[(server, 53)], reactor=reactor)
# Use it to do an A request for the name
return resolver.lookupAddress(name)
#inlineCallbacks
def main(reactor, *names):
# Here's some random DNS servers to which to issue requests.
servers = ["4.2.2.1", "8.8.8.8"]
# Handy trick to cycle through those servers forever
next_server = cycle(servers).next
# Issue queries for all the names given, alternating between servers.
results = []
for n in names:
results.append(query(reactor, next_server(), n))
# Wait for all the results
results = yield gatherResults(results)
# And report them
pprint(zip(names, results))
if __name__ == '__main__':
# Run the main program with the reactor going and pass names
# from the command line arguments to be resolved
react(main, argv[1:])

Try gevent, spawn many greenlets to do a DNS resolution. Also gevent has a nice DNS resolution API : http://www.gevent.org/gevent.dns.html
They have even an example:
https://github.com/gevent/gevent/blob/master/examples/dns_mass_resolve.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems with Speed during web-crawling (Python) - python

Related

Send lot of requests to several equipments at the same time

Web crawler returning list vs generator vs producer/consumer

Parallelizing loop for downloading data

gevent / requests hangs while making lots of head requests

Asynchronous DNS resolver testing

Categories

Resources