Python Asynchronous Reverse DNS Lookups - python

I am looking to do a large number of reverse DNS lookups in a small amount of time. I currently have implemented an asynchronous lookup using socket.gethostbyaddr and concurrent.futures thread pool, but am still not seeing the desired performance. For example, the script took about 22 minutes to complete on 2500 IP addresses.
I was wondering if there is any quicker way to do this without resorting to something like adns-python. I found this http://blog.schmichael.com/2007/09/18/a-lesson-on-python-dns-and-threads/ which provided some additional background.
Code Snippet:
ips = [...]
with concurrent.futures.ThreadPoolExecutor(max_workers = 16) as pool:
list(pool.map(get_hostname_from_ip, ips))
def get_hostname_from_ip(ip):
try:
return socket.gethostbyaddr(ip)[0]
except:
return ""
I think part of the issue is that many of the IP addresses are not resolving and timing out. I tried:
socket.setdefaulttimeout(2.0)
but it seems to have no effect.

I discovered my main issue was IPs failing to resolve and thus sockets not obeying their set timeouts and failing after 30 seconds. See Python 2.6 urlib2 timeout issue.
adns-python was a no-go because of its lack of support for IPv6 (without patches).
After searching around I found this: Reverse DNS Lookups with dnspython and implemented a similar version in my code (his code also uses an optional thread pool and implements a timeout).
In the end I used dnspython with a concurrent.futures thread pool for asynchronous reverse DNS lookups (see Python: Reverse DNS Lookup in a shared hosting and Dnspython: Setting query timeout/lifetime). With a timeout of 1 second this cut runtime from about 22 minutes to about 16 seconds on 2500 IP addresses. The large difference can probably be attributed to the Global Interpreter Lock on sockets and the 30 second timeouts.
Code Snippet:
import concurrent.futures
from dns import resolver, reversename
dns_resolver = resolver.Resolver()
dns_resolver.timeout = 1
dns_resolver.lifetime = 1
ips = [...]
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers = 16) as pool:
results = list(pool.map(get_hostname_from_ip, ips))
def get_hostname_from_ip(ip):
try:
reverse_name = reversename.from_address(ip)
return dns_resolver.query(reverse_name, "PTR")[0].to_text()[:-1]
except:
return ""

Because of the Global Interpreter Lock, you should use ProcessPoolExecutor instead.
https://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor

please, use asynchronous DNS, everything else will give you a very poor performance.

Related

Implement gethostbyaddr() with asyncore

I was having fun with socket.gethostbyaddr(), searching how to speed up a really simple code that generate some IP address randomly and try to solve them. The problem comes when no host can be found, there is a timeout that can be really long (about 10 seconds...)
By chance, I found this article, he solves the problem by using Multi-threading : https://www.depier.re/attempts_to_speed_up_gethostbyaddr/
I was wondering if it is possible to do something equivalent using Asyncore ? That's what I tried to do first but failed miserably...
Here is a template :
import socket
import random
def get_ip():
a = str(random.randint(140,150))
b = str(random.randint(145,150))
c = str(random.randint(145,150))
for d in range(100):
addr = a + "." + b + "." + c +"."+ str(1 + d)
yield addr
for addr in get_ip():
try:
o = socket.gethostbyaddr(addr)
print addr + "...Ok :"
print "---->"+ str(o[0])
except:
print addr + "...Nothing"
You are looking for a way how to convert several IPs to names (or vice versa) in parallel. Basically it is a DNS request/response operation and the gethostbyaddr is doing this lookup synchronously, i.e. in a blocking manner. It sends the request, waits for the response, returns the result.
asyncio and similar libraries use so called coroutines and cooperative scheduling. Cooperative means that coroutines are written to support the concurency. A running coroutine explicitly returns the control (using await or yield from) to a waiting scheduler which then selects another coroutine and runs it until that one returns the control and so on. Only one coroutine can be running at a time. For a smooth run coroutines must not execute code for a longer time without returning the control. A blocking operation in a coroutine blocks the whole programs. That prohibits the usage of gethostbyaddr.
A solution requires support for asynchronous DNS lookups. A coroutine sends the DNS request, sets a timeout, arranges that a DNS response will be passed to it and returns the control. Thus multiple coroutines can send their requests one after another before they wait for all the responses.
There are third party libraries for async DNS, but I have never used them. Looking at aiodns examples, it seems quite easy to write the code you are looking for. asyncore.gather would be probably the core of such function.

Python3: RuntimeError: can't start new thread [duplicate]

I have a site that runs with follow configuration:
Django + mod-wsgi + apache
In one of user's request, I send another HTTP request to another service, and solve this by httplib library of python.
But sometimes this service don't get answer too long, and timeout for httplib doesn't work. So I creating thread, in this thread I send request to service, and join it after 20 sec (20 sec - is a timeout of request). This is how it works:
class HttpGetTimeOut(threading.Thread):
def __init__(self,**kwargs):
self.config = kwargs
self.resp_data = None
self.exception = None
super(HttpGetTimeOut,self).__init__()
def run(self):
h = httplib.HTTPSConnection(self.config['server'])
h.connect()
sended_data = self.config['sended_data']
h.putrequest("POST", self.config['path'])
h.putheader("Content-Length", str(len(sended_data)))
h.putheader("Content-Type", 'text/xml; charset="utf-8"')
if 'base_auth' in self.config:
base64string = base64.encodestring('%s:%s' % self.config['base_auth'])[:-1]
h.putheader("Authorization", "Basic %s" % base64string)
h.endheaders()
try:
h.send(sended_data)
self.resp_data = h.getresponse()
except httplib.HTTPException,e:
self.exception = e
except Exception,e:
self.exception = e
something like this...
And use it by this function:
getting = HttpGetTimeOut(**req_config)
getting.start()
getting.join(COOPERATION_TIMEOUT)
if getting.isAlive(): #maybe need some block
getting._Thread__stop()
raise ValueError('Timeout')
else:
if getting.resp_data:
r = getting.resp_data
else:
if getting.exception:
raise ValueError('REquest Exception')
else:
raise ValueError('Undefined exception')
And all works fine, but sometime I start catching this exception:
error: can't start new thread
at the line of starting new thread:
getting.start()
and the next and the final line of traceback is
File "/usr/lib/python2.5/threading.py", line 440, in start
_start_new_thread(self.__bootstrap, ())
And the answer is: What's happen?
Thank's for all, and sorry for my pure English. :)
The "can't start new thread" error almost certainly due to the fact that you have already have too many threads running within your python process, and due to a resource limit of some kind the request to create a new thread is refused.
You should probably look at the number of threads you're creating; the maximum number you will be able to create will be determined by your environment, but it should be in the order of hundreds at least.
It would probably be a good idea to re-think your architecture here; seeing as this is running asynchronously anyhow, perhaps you could use a pool of threads to fetch resources from another site instead of always starting up a thread for every request.
Another improvement to consider is your use of Thread.join and Thread.stop; this would probably be better accomplished by providing a timeout value to the constructor of HTTPSConnection.
You are starting more threads than can be handled by your system. There is a limit to the number of threads that can be active for one process.
Your application is starting threads faster than the threads are running to completion. If you need to start many threads you need to do it in a more controlled manner I would suggest using a thread pool.
I was running on a similar situation, but my process needed a lot of threads running to take care of a lot of connections.
I counted the number of threads with the command:
ps -fLu user | wc -l
It displayed 4098.
I switched to the user and looked to system limits:
sudo -u myuser -s /bin/bash
ulimit -u
Got 4096 as response.
So, I edited /etc/security/limits.d/30-myuser.conf and added the lines:
myuser hard nproc 16384
myuser soft nproc 16384
Restarted the service and now it's running with 7017 threads.
Ps. I have a 32 cores server and I'm handling 18k simultaneous connections with this configuration.
I think the best way in your case is to set socket timeout instead of spawning thread:
h = httplib.HTTPSConnection(self.config['server'],
timeout=self.config['timeout'])
Also you can set global default timeout with socket.setdefaulttimeout() function.
Update: See answers to Is there any way to kill a Thread in Python? question (there are several quite informative) to understand why. Thread.__stop() doesn't terminate thread, but rather set internal flag so that it's considered already stopped.
I completely rewrite code from httplib to pycurl.
c = pycurl.Curl()
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, CONNECTION_TIMEOUT)
c.setopt(pycurl.TIMEOUT, COOPERATION_TIMEOUT)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.POST, 1)
c.setopt(pycurl.SSL_VERIFYHOST, 0)
c.setopt(pycurl.SSL_VERIFYPEER, 0)
c.setopt(pycurl.URL, "https://"+server+path)
c.setopt(pycurl.POSTFIELDS,sended_data)
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.perform()
something like that.
And I testing it now. Thanks all of you for help.
If you are tying to set timeout why don't you use urllib2.
I'm running a python script on my machine only to copy and convert some files from one format to another, I want to maximize the number of running threads to finish as quickly as possible.
Note: It is not a good workaround from an architecture perspective If you aren't using it for a quick script on a specific machine.
In my case, I checked the max number of running threads that my machine can run before I got the error, It was 150
I added this code before starting a new thread. which checks if the max limit of running threads is reached then the app will wait until some of the running threads finish, then it will start new threads
while threading.active_count()>150 :
time.sleep(5)
mythread.start()
If you are using a ThreadPoolExecutor, the problem may be that your max_workers is higher than the threads allowed by your OS.
It seems that the executor keeps the information of the last executed threads in the process table, even if the threads are already done. This means that when your application has been running for a long time, eventually it will register in the process table as many threads as ThreadPoolExecutor.max_workers
As far as I can tell it's not a python problem. Your system somehow cannot create another thread (I had the same problem and couldn't start htop on another cli via ssh).
The answer of Fernando Ulisses dos Santos is really good. I just want to add, that there are other tools limiting the number of processes and memory usage "from the outside". It's pretty common for virtual servers. Starting point is the interface of your vendor or you might have luck finding some information in files like
/proc/user_beancounters

gevent / requests hangs while making lots of head requests

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.
Please take a look at my code snippet below, and let me know what I should change.
import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests
def get_head(url, timeout=3):
try:
return requests.head(url, allow_redirects=True, timeout=timeout)
except:
return None
def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
results = {}
for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
print '\t%d. processing %d urls # %s' % (i, chunk_size, str(datetime.datetime.now()))
jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
gevent.joinall(jobs, timeout=timeout)
results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
return results
I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.
The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter
whether I used a gevent Pool (i.e. one greenlet per connection) or a fixed set of greenlets, all requesting multiple URLs
how large I set the pool size
In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced
result = requests.head(url)
with
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)
Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:
def gethead(url):
result = None
try:
with Timeout(5, False):
result = requests.head(url)
except Exception as e:
result = e
return result
Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!
Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:
def qworker(qin, qout):
while True:
try:
qout.put(gethead(qin.get(block=False)))
except Empty:
break
qin = Queue()
qout = Queue()
for url in urls:
qin.put(url)
workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]
Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:
# netstat -tpn | wc -l
10074
That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.
Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.
In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.
I should address your two main questions, specifically:
I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.
your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.
your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.
I'm not sure if this will resolve your issue, but you are not using pool.Pool() correctly.
Try this:
def expand_short_urls(short_urls, chunk_size=100):
# Pool() automatically limits your process to chunk_size greenlets running concurrently
# thus you don't need to do all that chunking business you were doing in your for loop
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
# spawn() (both gevent.spawn() and Pool.spawn()) returns a gevent.Greenlet object
# NOT the value your function, get_head, will return
threads = [p.spawn(get_head, short_url) for short_url in short_urls]
p.join()
# to access the returned value of your function, access the Greenlet.value property
results = {short_url: thread.value.url for short_url, thread in zip(short_urls, threads)
if thread.value is not None and thread.value.status_code == 200}
return results

Asynchronous DNS resolver testing

I want to test a large number of IPs to look for open DNS resolvers. I'm trying to find the most efficient way to parallelize this. At the moment I'm trying to accomplish this with twisted. I want to have 10 or 20 parallel threads sending a query to avoid blocking trough timeouts.
Twisted has a DNSDatagramProtocol that seems suitable but I just can't figure out how to put it together with the twisted "reactor" and "threads" facilities to make it run efficiently.
I read a lot of the twisted documentation but I'm still not sure what would be the best way to do it.
Could someone give an example how this can be accomplished?
Here's a quick example demonstrating the Twisted Names API:
from sys import argv
from itertools import cycle
from pprint import pprint
from twisted.names import client
from twisted.internet.task import react
from twisted.internet.defer import gatherResults, inlineCallbacks
def query(reactor, server, name):
# Create a new resolver that uses the given DNS server
resolver = client.Resolver(
resolv="/dev/null", servers=[(server, 53)], reactor=reactor)
# Use it to do an A request for the name
return resolver.lookupAddress(name)
#inlineCallbacks
def main(reactor, *names):
# Here's some random DNS servers to which to issue requests.
servers = ["4.2.2.1", "8.8.8.8"]
# Handy trick to cycle through those servers forever
next_server = cycle(servers).next
# Issue queries for all the names given, alternating between servers.
results = []
for n in names:
results.append(query(reactor, next_server(), n))
# Wait for all the results
results = yield gatherResults(results)
# And report them
pprint(zip(names, results))
if __name__ == '__main__':
# Run the main program with the reactor going and pass names
# from the command line arguments to be resolved
react(main, argv[1:])
Try gevent, spawn many greenlets to do a DNS resolution. Also gevent has a nice DNS resolution API : http://www.gevent.org/gevent.dns.html
They have even an example:
https://github.com/gevent/gevent/blob/master/examples/dns_mass_resolve.py

Problems with Speed during web-crawling (Python)

I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code:
from eventlet.green import urllib2
import httplib
import time
import eventlet
# Create the URLS in groups of 400 (+- max for eventlet)
def web_CreateURLS():
print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
for var_indexURLS in xrange(0, 2000000, 400):
var_URLS = []
for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
var_URLS.append("http://www.nu.nl")
web_ScanURLS(var_URLS)
# Return the HTML Source per URL
def web_ReturnHTML(url):
try:
return [urllib2.urlopen(url[0]).read(), url[1]]
except urllib2.URLError:
time.sleep(10)
print "UrlError"
web_ReturnHTML(url)
# Analyse the HTML Source
def web_ScanURLS(var_URLS):
pool = eventlet.GreenPool()
try:
for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
# do something etc..
except TypeError: pass
web_CreateURLS()
I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes.
Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel.
I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily.
And then of course there is always redis pub/sub or sync. :)
"Share the load"

Categories

Resources