I want to test a large number of IPs to look for open DNS resolvers. I'm trying to find the most efficient way to parallelize this. At the moment I'm trying to accomplish this with twisted. I want to have 10 or 20 parallel threads sending a query to avoid blocking trough timeouts.
Twisted has a DNSDatagramProtocol that seems suitable but I just can't figure out how to put it together with the twisted "reactor" and "threads" facilities to make it run efficiently.
I read a lot of the twisted documentation but I'm still not sure what would be the best way to do it.
Could someone give an example how this can be accomplished?
Here's a quick example demonstrating the Twisted Names API:
from sys import argv
from itertools import cycle
from pprint import pprint
from twisted.names import client
from twisted.internet.task import react
from twisted.internet.defer import gatherResults, inlineCallbacks
def query(reactor, server, name):
# Create a new resolver that uses the given DNS server
resolver = client.Resolver(
resolv="/dev/null", servers=[(server, 53)], reactor=reactor)
# Use it to do an A request for the name
return resolver.lookupAddress(name)
#inlineCallbacks
def main(reactor, *names):
# Here's some random DNS servers to which to issue requests.
servers = ["4.2.2.1", "8.8.8.8"]
# Handy trick to cycle through those servers forever
next_server = cycle(servers).next
# Issue queries for all the names given, alternating between servers.
results = []
for n in names:
results.append(query(reactor, next_server(), n))
# Wait for all the results
results = yield gatherResults(results)
# And report them
pprint(zip(names, results))
if __name__ == '__main__':
# Run the main program with the reactor going and pass names
# from the command line arguments to be resolved
react(main, argv[1:])
Try gevent, spawn many greenlets to do a DNS resolution. Also gevent has a nice DNS resolution API : http://www.gevent.org/gevent.dns.html
They have even an example:
https://github.com/gevent/gevent/blob/master/examples/dns_mass_resolve.py
Related
I’m currently working on a project and I need to fetch data from several switches by sending SSH requests as follow:
Switch 1 -> 100 requests
Switch 2 -> 500 requests
Switch 3 -> 1000 requests
…
Switch 70 -> 250 requests
So several requests (5500 in total) spread over 70 switches.
Today, I am using a json file built like this:
{
"ip_address1":
[
{"command":"command1"},
{"command":"command2"},
...
{"command":"command100"}
],
"ip_address2":
[
{"command":"command1"},
{"command":"command2"},
...
{"command":"command100"}
],
…
"ip_address70":
[
{"command":"command1"},
{"command":"command2"},
...
{"command":"command100"}
],
}
Each command is a CLI command to a switch which I’m connecting on by SSH.
Today, I’m using Python with multi threading with 8 workers because I have only 4 CPUs.
The total of the script make 1 hour to proceed so it’s too long.
Is there a way to drastically speed up this process please?
A friend told me about Golang channels and go routines but I’m not sure if it’s interesting to move from Python to Go if there’s no difference about the time.
Can you please give me some advices?
Thank you very much,
Python offers a pretty straight forward multiprocessing library. Especially for a straight forward task like yours I would stick to the language I am the most comfortable with.
In python you would basically generate a list from your list of commands and ip addresses.
Using an example straight from the documentation: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool
With the pool.map function from the multiprocessing module, you can pass each element from your list to a function, where you can pass your commands to the servers. You might want to have another look at the different mapping functions provided for the pool module.
from multiprocessing import Pool, TimeoutError
import os
def execute_ssh(address_command_mapping):
# add your logic to pass the commands to the corresponding IP address
return
if __name__ == '__main__':
# assuming your ip addresses are stored in a json file
with open("ip_addresses.json", "r") as file:
ip_addresses = json.load(file)
# transforming the address dictionary to a list of dictionaries
address_list = [{ip: commands} for ip, commands in ip_addresses.items()]
# start 4 worker processes
with Pool(processes=4) as pool:
# pool.map passes each element to the 'execute_ssh' function
pool.map(execute_ssh, address_list)
Thank you Leon,
Is the pool.map function working the same way as the thread pool executor module?
Here is what I’m using:
from concurrent.futures import ThreadPoolExecutor
def task(n):
// sending command
def main():
print("Starting ThreadPoolExecutor")
with ThreadPoolExecutor(max_workers=3) as executor:
for element in mylist:
executor.submit(task, (element))
print("All tasks complete")
if __name__ == '__main__':
main()
So is it working the same way?
Thank you
I'm trying to create a discovery script, which will use multithreading to ping multiple IP addresses at once.
import ipaddress
import sh
from threading import Thread
from Queue import Queue
user_input = raw_input("")
network = ipaddress.ip_network(unicode(user_input))
def pingit(x):
for i in x.hosts():
try:
sh.ping(i, "-c 1")
print i, "is active"
except sh.ErrorReturnCode_1:
print "no response from", i
queue_to_work = Queue(maxsize=0)
number_of_workers = 30
for i in range(number_of_workers):
workers = Thread(target=pingit(network),args=(queue_to_work,))
workers.getDaemon(True)
workers.start()
When I run this script, I get the ping responses, but it is not fast. I believe the multithreading is not kicking in.
Could someone please tell me where I'm going wrong?
Many thanks.
You are doing it completely wrong.
import ipaddress
import sh
from threading import Thread
user_input = raw_input("")
network = ipaddress.ip_network(unicode(user_input))
def pingit(x):
for i in x.hosts():
try:
sh.ping(i, "-c 1")
print i, "is active"
except sh.ErrorReturnCode_1:
print "no response from", i
workers = Thread(target=pingit,args=(network,))
workers.start()
This is how you start a thread. Writing pingit(network) will actually run the function, and pass its result into Thread, while you want to pass the function itself. You should pass function pingit and the argument network separately. Note this creates a thread that practically run pingit(network).
Now, if you want to use multiple threads, you can do so in a loop. But you also have to create separate sets of data to feed into the threads. Consider you have a list of hosts, e.g. ['A', 'B', 'C', 'D'], and you want to ping them from two threads. You have to create two threads, that call pingit(['A', 'B']) and pingit(['C', 'D']).
As a side note, don't use ip_network to find the ip addresses, use ip_address. You can ping an ip address, but not a network. Of course if you want to ping all ip addresses in the network, ip_network is fine.
You may want to somehow split the user input into multiple ip addresses and separate the list into sublists for your threads. This is pretty easy. Now you can write a for to create threads, and feed each sublist into arguments of the thread. This creates threads that actually run with different parameters.
I would like to share my thoughts on this.
Since I guess this is something you would like to run in the background, I would suggest you use a Queue instead of a Thread.
This will offer you multiple advantages:
You can add multiple functionalities into the queue
If something happens, the queue will just continue, and catch the error for you. You can even add some logging to it in case something goes wrong.
The queue runs as a daemon, with every item in the queue as it's own process
Systems like RabbitMQ or Redis are build for this specific kind of task.
It is relatively easy to setup
I have created a simple script for you that you might be able to use:
import subprocess
from celery import Celery
app = Celery()
#app.task
def check_host(ip, port=80, timeout=1):
output = subprocess.Popen(["ping", "-c", "1", ip], stdout=subprocess.PIPE).communicate()[0]
if "1 packets received" in output.decode("utf-8"):
return "{ip} connected successfully".format_map(vars())
else:
return "{ip} was unable to connect".format_map(vars())
def pingit(ip="8.8.8.8"):
check_host.delay(ip)
What this does is the following.
You first import Celery, this will make you able to connect to Celery that runs in the background.
You create an app which is in instance of the Celery class
You use this app to create a task. Inside this you put task all the actions you want to perform async.
You call the delay() method on the task
Now task will run on the background, and all other tasks will be put in the queue to run async for you.
So you can just put everything in a loop, and the Queue will handle it for you.
The information about Celery: http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html
And a great tutorial to get everything setup I found on YouTube: https://www.youtube.com/watch?v=68QWZU_gCDA
I hope this can help you a bit further
Is there any way to make multiple calls from an xmlrpc client to different xmlrpc servers at a time.
My Server code looks like this: (I'll have this code runnning in two machines, server1 & server2)
class TestMethods(object):
def printHello(self):
while(1):
time.sleep(10)
print "HELLO FROM SERVER"
return True
class ServerThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.server = SimpleXMLRPCServer(("x.x.x.x", 8000))
self.server.register_instance(TestMethods())
def run(self):
self.server.serve_forever()
server = ServerThread()
server.start()
My Client code looks like this:
import xmlrpclib
client1 = xmlrpclib.ServerProxy("http://x.x.x.x:8080") # registering with server 1
client2 = xmlrpclib.ServerProxy("http:/x.x.x.x:8080") # registering with server 2
ret1 = client1.printHello()
ret2 = client2.printHello()
Now, on the 10th second I'll get a response from server1 and on the 20th second I'll get a response from server2 which is unfortunately not what I want.
I'm trying to make calls to two machines at a time so that I get the response back from those two machines at a time.
PLease help me out, THanks in advance.
There a a few different ways to do this.
python multiprocessing
Is the built-in python module for running stuff in parallel. The docs are fairly clear. The easiest & most extensible way using this method is with a 'Pool' of workers that you can add as many to as you want.
from multiprocessing import Pool
import xmlrpclib
def hello_client(url):
client = xmlrpclib.ServerProxy(url)
return client.printHello()
p = Pool(processes=10) # maximum number of requests at once.
servers_list = ['http://x.x.x.x:8080', 'http://x.x.x.x:8080']
# you can add as many other servers into that list as you want!
results = p.map(hello_client, servers_list)
print results
twisted python
twisted python is an amazing clever system for writing all kinds of multithreaded / parallel / multiprocess stuff. The documentation is a bit confusing.
Tornado
Another non-blocking python framework. Also very cool. Here's an answer about XMLRPC, python, and tornado.
gevent
A 'magic' way of allowing blocking tasks in python to happen in the background. Very very cool. And here's a question about how to use XMLRPC in python with gevent.
I am looking to do a large number of reverse DNS lookups in a small amount of time. I currently have implemented an asynchronous lookup using socket.gethostbyaddr and concurrent.futures thread pool, but am still not seeing the desired performance. For example, the script took about 22 minutes to complete on 2500 IP addresses.
I was wondering if there is any quicker way to do this without resorting to something like adns-python. I found this http://blog.schmichael.com/2007/09/18/a-lesson-on-python-dns-and-threads/ which provided some additional background.
Code Snippet:
ips = [...]
with concurrent.futures.ThreadPoolExecutor(max_workers = 16) as pool:
list(pool.map(get_hostname_from_ip, ips))
def get_hostname_from_ip(ip):
try:
return socket.gethostbyaddr(ip)[0]
except:
return ""
I think part of the issue is that many of the IP addresses are not resolving and timing out. I tried:
socket.setdefaulttimeout(2.0)
but it seems to have no effect.
I discovered my main issue was IPs failing to resolve and thus sockets not obeying their set timeouts and failing after 30 seconds. See Python 2.6 urlib2 timeout issue.
adns-python was a no-go because of its lack of support for IPv6 (without patches).
After searching around I found this: Reverse DNS Lookups with dnspython and implemented a similar version in my code (his code also uses an optional thread pool and implements a timeout).
In the end I used dnspython with a concurrent.futures thread pool for asynchronous reverse DNS lookups (see Python: Reverse DNS Lookup in a shared hosting and Dnspython: Setting query timeout/lifetime). With a timeout of 1 second this cut runtime from about 22 minutes to about 16 seconds on 2500 IP addresses. The large difference can probably be attributed to the Global Interpreter Lock on sockets and the 30 second timeouts.
Code Snippet:
import concurrent.futures
from dns import resolver, reversename
dns_resolver = resolver.Resolver()
dns_resolver.timeout = 1
dns_resolver.lifetime = 1
ips = [...]
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers = 16) as pool:
results = list(pool.map(get_hostname_from_ip, ips))
def get_hostname_from_ip(ip):
try:
reverse_name = reversename.from_address(ip)
return dns_resolver.query(reverse_name, "PTR")[0].to_text()[:-1]
except:
return ""
Because of the Global Interpreter Lock, you should use ProcessPoolExecutor instead.
https://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor
please, use asynchronous DNS, everything else will give you a very poor performance.
I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code:
from eventlet.green import urllib2
import httplib
import time
import eventlet
# Create the URLS in groups of 400 (+- max for eventlet)
def web_CreateURLS():
print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
for var_indexURLS in xrange(0, 2000000, 400):
var_URLS = []
for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
var_URLS.append("http://www.nu.nl")
web_ScanURLS(var_URLS)
# Return the HTML Source per URL
def web_ReturnHTML(url):
try:
return [urllib2.urlopen(url[0]).read(), url[1]]
except urllib2.URLError:
time.sleep(10)
print "UrlError"
web_ReturnHTML(url)
# Analyse the HTML Source
def web_ScanURLS(var_URLS):
pool = eventlet.GreenPool()
try:
for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
# do something etc..
except TypeError: pass
web_CreateURLS()
I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes.
Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel.
I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily.
And then of course there is always redis pub/sub or sync. :)
"Share the load"