gevent / requests hangs while making lots of head requests - python

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.
Please take a look at my code snippet below, and let me know what I should change.
import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests
def get_head(url, timeout=3):
try:
return requests.head(url, allow_redirects=True, timeout=timeout)
except:
return None
def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
results = {}
for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
print '\t%d. processing %d urls # %s' % (i, chunk_size, str(datetime.datetime.now()))
jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
gevent.joinall(jobs, timeout=timeout)
results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
return results
I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.

The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter
whether I used a gevent Pool (i.e. one greenlet per connection) or a fixed set of greenlets, all requesting multiple URLs
how large I set the pool size
In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced
result = requests.head(url)
with
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)
Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:
def gethead(url):
result = None
try:
with Timeout(5, False):
result = requests.head(url)
except Exception as e:
result = e
return result
Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!
Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:
def qworker(qin, qout):
while True:
try:
qout.put(gethead(qin.get(block=False)))
except Empty:
break
qin = Queue()
qout = Queue()
for url in urls:
qin.put(url)
workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]
Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:
# netstat -tpn | wc -l
10074
That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.
Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.
In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.
I should address your two main questions, specifically:
I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.
your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.
your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.

I'm not sure if this will resolve your issue, but you are not using pool.Pool() correctly.
Try this:
def expand_short_urls(short_urls, chunk_size=100):
# Pool() automatically limits your process to chunk_size greenlets running concurrently
# thus you don't need to do all that chunking business you were doing in your for loop
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
# spawn() (both gevent.spawn() and Pool.spawn()) returns a gevent.Greenlet object
# NOT the value your function, get_head, will return
threads = [p.spawn(get_head, short_url) for short_url in short_urls]
p.join()
# to access the returned value of your function, access the Greenlet.value property
results = {short_url: thread.value.url for short_url, thread in zip(short_urls, threads)
if thread.value is not None and thread.value.status_code == 200}
return results

Related

Python Maximum Thread Count [duplicate]

import threading
threads = []
for n in range(0, 60000):
t = threading.Thread(target=function,args=(x, n))
t.start()
threads.append(t)
for t in threads:
t.join()
It is working well for range up to 800 on my laptop, but if I increase range to more than 800 I get the error can't create new thread.
How can I control number to threads to get created or any other way to make it work like timeout? I tried using threading.BoundedSemaphore function but that doesn't seem to work properly.
The problem is that no major platform (as of mid-2013) will let you create anywhere near this number of threads. There are a wide variety of different limitations you could run into, and without knowing your platform, its configuration, and the exact error you got, it's impossible to know which one you ran into. But here are two examples:
On 32-bit Windows, the default thread stack is 1MB, and all of your thread stacks have to fit into the same 2GB of virtual memory space as everything else in your program, so you will run out long before 60000.
On 64-bit linux, you will likely exhaust one of your session's soft ulimit values before you get anywhere near running out of page space. (Linux has a variety of different limits beyond the ones required by POSIX.)
So, how can i control number to threads to get created or any other way to make it work like timeout or whatever?
Using as many threads as possible is very unlikely to be what you actually want to do. Running 800 threads on an 8-core machine means that you're spending a whole lot of time context-switching between the threads, and the cache keeps getting flushed before it ever gets primed, and so on.
Most likely, what you really want is one of the following:
One thread per CPU, serving a pool of 60000 tasks.
Maybe processes instead of threads (if the primary work is in Python, or in C code that doesn't explicitly release the GIL).
Maybe a fixed number of threads (e.g., a web browsers may do, say, 12 concurrent requests at a time, whether you have 1 core or 64).
Maybe a pool of, say, 600 batches of 100 tasks apiece, instead of 60000 single tasks.
60000 cooperatively-scheduled fibers/greenlets/microthreads all sharing one real thread.
Maybe explicit coroutines instead of a scheduler.
Or "magic" cooperative greenlets via, e.g. gevent.
Maybe one thread per CPU, each running 1/Nth of the fibers.
But it's certainly possible.
Once you've hit whichever limit you're hitting, it's very likely that trying again will fail until a thread has finished its job and been joined, and it's pretty likely that trying again will succeed after that happens. So, given that you're apparently getting an exception, you could handle this the same way as anything else in Python: with a try/except block. For example, something like this:
threads = []
for n in range(0, 60000):
while True:
t = threading.Thread(target=function,args=(x, n))
try:
t.start()
threads.append(t)
except WhateverTheExceptionIs as e:
if threads:
threads[0].join()
del threads[0]
else:
raise
else:
break
for t in threads:
t.join()
Of course this assumes that the first task launched is likely to be the one of the first tasks finished. If this is not true, you'll need some way to explicitly signal doneness (condition, semaphore, queue, etc.), or you'll need to use some lower-level (platform-specific) library that gives you a way to wait on a whole list until at least one thread is finished.
Also, note that on some platforms (e.g., Windows XP), you can get bizarre behavior just getting near the limits.
On top of being a lot better, doing the right thing will probably be a lot simpler as well. For example, here's a process-per-CPU pool:
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
concurrent.futures.wait(fs)
… and a fixed-thread-count pool:
with concurrent.futures.ThreadPoolExecutor(12) as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
concurrent.futures.wait(fs)
… and a balancing-CPU-parallelism-with-numpy-vectorization batching pool:
with concurrent.futures.ThreadPoolExecutor() as executor:
batchsize = 60000 // os.cpu_count()
fs = [executor.submit(np.vector_function, x,
np.arange(n, min(n+batchsize, 60000)))
for n in range(0, 60000, batchsize)]
concurrent.futures.wait(fs)
In the examples above, I used a list comprehension to submit all of the jobs and gather their futures, because we're not doing anything else inside the loop. But from your comments, it sounds like you do have other stuff you want to do inside the loop. So, let's convert it back into an explicit for statement:
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = []
for n in range(60000):
fs.append(executor.submit(function, x, n))
concurrent.futures.wait(fs)
And now, whatever you want to add inside that loop, you can.
However, I don't think you actually want to add anything inside that loop. The loop just submits all the jobs as fast as possible; it's the wait function that sits around waiting for them all to finish, and it's probably there that you want to exit early.
To do this, you can use wait with the FIRST_COMPLETED flag, but it's much simpler to use as_completed.
Also, I'm assuming error is some kind of value that gets set by the tasks. In that case, you will need to put a Lock around it, as with any other mutable value shared between threads. (This is one place where there's slightly more than a one-line difference between a ProcessPoolExecutor and a ThreadPoolExecutor—if you use processes, you need multiprocessing.Lock instead of threading.Lock.)
So:
error_lock = threading.Lock
error = []
def function(x, n):
# blah blah
try:
# blah blah
except Exception as e:
with error_lock:
error.append(e)
# blah blah
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
for f in concurrent.futures.as_completed(fs):
do_something_with(f.result())
with error_lock:
if len(error) > 1: exit()
However, you might want to consider a different design. In general, if you can avoid sharing between threads, your life gets a lot easier. And futures are designed to make that easy, by letting you return a value or raise an exception, just like a regular function call. That f.result() will give you the returned value or raise the raised exception. So, you can rewrite that code as:
def function(x, n):
# blah blah
# don't bother to catch exceptions here, let them propagate out
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
error = []
for f in concurrent.futures.as_completed(fs):
try:
result = f.result()
except Exception as e:
error.append(e)
if len(error) > 1: exit()
else:
do_something_with(result)
Notice how similar this looks to the ThreadPoolExecutor Example in the docs. This simple pattern is enough to handle almost anything without locks, as long as the tasks don't need to interact with each other.

Python3 can't pickle _thread.RLock objects on list with multiprocessing

I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list(name is liste_test) that consist of 280.000 used car announcement URL.
def araba_cekici(liste_test,headers,engine):
for link in liste_test:
try:
page = requests.get(link, headers=headers)
.....
.....
When I start my code like that:
araba_cekici(liste_test,headers,engine)
It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing.
I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working.
import numpy as np
import multiprocessing as multi
def chunks(n, page_list):
"""Splits the list into n chunks"""
return np.array_split(page_list,n)
cpus = multi.cpu_count()
workers = []
page_bins = chunks(cpus, liste_test)
for cpu in range(cpus):
sys.stdout.write("CPU " + str(cpu) + "\n")
# Process that will send corresponding list of pages
# to the function perform_extraction
worker = multi.Process(name=str(cpu),
target=araba_cekici,
args=(page_bins[cpu],headers,engine))
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
And it gives:
TypeError: can't pickle _thread.RLock objects
I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code).
Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.
Late answer, but since this question turns up when searching on Google: multiprocessing sends the data to the worker processes via a multiprocessing.Queue, which requires all data/objects sent to be picklable.
In your code, you try to pass header and engine, whose implementations you don't show. (Since header holds the HTTP request header, I suspect that engine is the issue here.) To solve your issue, you either have to make engine picklable, or only instantiate engine within the worker process.

Web crawler returning list vs generator vs producer/consumer

I want to recursively crawl a web-server that hosts thousands of files and then check if they are different from what's in the local repository (this is a part of checking the delivery infrastructure for bugs).
So far I've been playing around with various prototypes and here is what I noticed. If I do a straightforward recursion and put all the files into a list, the operation completes in around 230 seconds. Note that I make only one request per directory, so it makes sense to actually download the files I'm interested in elsewhere:
def recurse_links(base):
result = []
try:
f = urllib.request.urlopen(base)
soup = BeautifulSoup(f.read(), "html.parser")
for anchor in soup.find_all('a'):
href = anchor.get('href')
if href.startswith('/') or href.startswith('..'):
pass
elif href.endswith('/'):
recurse_links(base + href)
else:
result.append(base + href)
except urllib.error.HTTPError as httperr:
print('HTTP Error in ' + base + ': ' + str(httperr))
I figured, if I could start processing the files I'm interested in while the crawler is still working, I could save time. So the next thing I tried was a generator that could be further used as a coroutine. The generator took 260 seconds, slightly more, but still acceptable. Here's the generator:
def recurse_links_gen(base):
try:
f = urllib.request.urlopen(base)
soup = BeautifulSoup(f.read(), "html.parser")
for anchor in soup.find_all('a'):
href = anchor.get('href')
if href.startswith('/') or href.startswith('..'):
pass
elif href.endswith('/'):
yield from recurse_links_gen(base + href)
else:
yield base + href
except urllib.error.HTTPError as http_error:
print(f'HTTP Error in {base}: {http_error}')
Update
Answering some questions that came up in the comments section:
I've got roughly 370k files, but not all of them will make it to the next step. I will check them against a set or dictionary (to get O(1) lookup) before going ahead and compare them to local repo
After more tests it looks like sequential crawler takes less time in roughly 4 out of 5 attempts. And generator took less time once. So at this point is seems like generator is okay
At this point consumer doesn't do anything other than get an item from queue, since it's a concept. However I have flexibility in what I will do with the file URL I get from producer. I can for instance, download only first 100KB of file, calculate it's checksum while in memory and then compare to a pre-calculated local version. What's clear though is that if simply adding thread creation bumps my execution time by a factor of 4 to 5, adding work on consumer thread will not make it any faster.
Finally I decided to give producer/consumer/queue a shot and a simple PoC ran 4 times longer while loading 100% of one CPU core. Here is the brief code (the crawler is the same generator-based crawler from above):
class ProducerThread(threading.Thread):
def __init__(self, done_event, url_queue, crawler, name):
super().__init__()
self._logger = logging.getLogger(__name__)
self.name = name
self._queue = url_queue
self._crawler = crawler
self._event = done_event
def run(self):
for file_url in self._crawler.crawl():
try:
self._queue.put(file_url)
except Exception as ex:
self._logger.error(ex)
So here are my questions:
Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores?
I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided?
Is the generator slower because it has to save the function context and then load it again over and over?
What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?
1) Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores?
Yes, these are the threads, but to utilize multiple cores of your CPU, you need to use multiprocessing package.
2) I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided?
It depends on the number of threads you are created, one reason may be due to context switches, your threads are making. The optimum value for thread should be 2/3, i.e create 2/3 threads and check the performance again.
3) Is the generator slower because it has to save the function context and then load it again over and over?
Generators are not slow, it is rather good for the problem you are working on, as you find a url , you put that into queue.
4) What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?
Create a ConsumerThread class, which fetches the data(url in your case) from the queue and start working on it.

Python: How to trigger multiple process at same instant

I am trying to run a process that does a http POST which in turn will send an alert(time taken to send an alert is in nano second) to a server. I am trying to test the capacity of the server in handling alerts in milliseconds. As per the given standard, the server is said to handle 6000 alerts/second.
I created a piece of code using multiprocessing module, which sends 6000 alerts, but I am using a for loop and hence the time taken to execute the for loop exceeds more than a second. And hence all the 6000 process are not triggered at SAME INSTANT.
Is there a way to trigger multiple(N number) process at same instant?
This is my code: flowtesting.py which is a library. And this is followed by my script after '####'
import json
import httplib2
class flowTesting():
def init(self, companyId, deviceIp):
self.companyId = companyId
self.deviceIp = deviceIp
def generate_savedSearchName(self, randNum):
self.randMsgId = randNum
self.savedSearchName = "TEST %s risk31 more than 3" % self.randMsgId
def def_request_body_dict(self):
self.reqBody_dict = \
{ "Header" : {"agid" : "Agent1",
"mid": self.randMsgId,
"ts" : 1253125001
},
"mp":
{
"host" : self.deviceIp,
"index" : self.companyId,
"savedSearchName" : self.savedSearchName,
}
}
self.req_body = json.dumps(self.reqBody_dict)
def get_default_hdrs(self):
self.hdrs = {'Content-type': 'application/json',
'Accept-Language': 'en-US,en;q=0.8'}
def send_request(self, sIp, method="POST"):
self.sIp = sIp
self.url = "http://%s:8080/agent/splunk/messages" % self.sIp
http_cli = httplib2.Http(timeout=180, disable_ssl_certificate_validation=True)
rsp, rsp_body = http_cli.request(uri=self.url, method=method, headers=self.hdrs, body=self.req_body)
print "rsp: %s and rsp_body: %s" % (rsp, rsp_body)
# My testScript
from flowTesting import flowTesting
import random
import multiprocessing
deviceIp = "10.31.421.35"
companyId = "CPY0000909"
noMsgToBeSent = 1000
sIp = "10.31.44.235"
uniq_msg_id_list = random.sample(xrange(1,10000), noMsgToBeSent)
def runner(companyId, deviceIp, uniq_msg_id):
proc = flowTesting(companyId, deviceIp)
proc.generate_savedSearchName(uniq_msg_id)
proc.def_request_body_dict()
proc.get_default_hdrs()
proc.send_request(sIp)
process_list = []
for uniq_msg_id in uniq_msg_id_list:
savedSearchName = "TEST-1000 %s risk31 more than 3" % uniq_msg_id
process = multiprocessing.Process(target=runner, args=(companyId,deviceIp,uniq_msg_id,))
process.start()
process.join()
process_list.append(process)
print "Process list: %s" % process_list
print "Unique Message Id: %s" % uniq_msg_id_list
Making them all happen in the same instant is obviously impossible—unless you have a 6000-core machine and an OS kernel whose scheduler is able to handle them all perfectly (which you don't), you can't get 6000 pieces of code running at once.
And, even if you did, what they're all trying to do is to send a message on a socket. Even if your kernel was that insanely parallel, unless you have 6000 separate NICs, they're going to end up serialized in the NIC buffer. That's the way IP works: one packet after another. And of course there are all the routers on the path, the server's NIC, the server's OS, etc. And even if IP doesn't get in the way, bytes take time to transfer over a cable. So the only way to do this at the same instant, even in theory, would be to have 6000 NICs on each side and wire them up directly to each other with identical fiber.
However, you don't really need them in the same instant, just closer to each other than they are. You didn't show us your code, but presumably you're just starting 6000 Processes that all immediately try to send a message. That means you're including the process startup time—which can be pretty slow (especially on Windows)—in the skew time.
You can reduce that by using threads instead of processes. That may seem counterintuitive, but Python is pretty good at handling I/O-bound threads, and every modern OS is very good at starting new threads.
But really, what you need is a Barrier on your threads or processes, to let all of them complete all the setup work (including process startup) before any of them try to do any work.
It still probably won't be tight enough, but it will be a lot tighter than you probably have right now.
The next limit you're going to face is context-switching time. Modern OSs are pretty good at scheduling, but not 6000-simultaneous-tasks good. So really, you want to reduce this to N processes, each one just spamming 6000/N connections sequentially as fast as possible. That will get them into the kernel/NIC much faster than trying to do 6000 at once and making the OS do the serialization for you. (In fact, on some platforms, depending on your hardware, you might actually be better off with one process doing 6000 in a row than N doing 6000/N. Test it both ways.)
There's still some overhead for the socket library itself. To get around that, you want to pre-craft all of the IP packets, then create a single raw socket and spam those packets. Send the first packet from each connection, then the second packet from each connection, etc.
You need to use an inter-process synchronization primitive. On Linux you would use a Sys-V semaphore, on Windows you would use a Win32 event.
Your 6000 processes would wait on this semaphore/event, and from a different process you would trigger it, thus releasing all your 6000 processes from their waiting state to a ready state, and then the OS would start executing them as quickly as possible.

Problems with Speed during web-crawling (Python)

I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code:
from eventlet.green import urllib2
import httplib
import time
import eventlet
# Create the URLS in groups of 400 (+- max for eventlet)
def web_CreateURLS():
print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
for var_indexURLS in xrange(0, 2000000, 400):
var_URLS = []
for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
var_URLS.append("http://www.nu.nl")
web_ScanURLS(var_URLS)
# Return the HTML Source per URL
def web_ReturnHTML(url):
try:
return [urllib2.urlopen(url[0]).read(), url[1]]
except urllib2.URLError:
time.sleep(10)
print "UrlError"
web_ReturnHTML(url)
# Analyse the HTML Source
def web_ScanURLS(var_URLS):
pool = eventlet.GreenPool()
try:
for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
# do something etc..
except TypeError: pass
web_CreateURLS()
I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes.
Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel.
I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily.
And then of course there is always redis pub/sub or sync. :)
"Share the load"

Categories

Resources