Resolving and saving hostnames in parallel with Python - python

I'm trying to resolve a list of hostnames. The problem is when I hit a non existent domain, it slows down the whole process. The code is a trivial for loop:
for domain in domains:
try:
if socket.gethostbyname(domain.split('#')[1]):
file1.write(domain)
else:
file2.write(domain)
except socket.gaierror:
pass
I was wondering if there is a simple way to parallelize what is inside the for loop.

You could use one of example from Gevent - dns_mass_resolve.py. There's also usefull possibility of setting timeout for all queries.
from __future__ import with_statement
import sys
import gevent
from gevent import socket
from gevent.pool import Pool
N = 1000
# limit ourselves to max 10 simultaneous outstanding requests
pool = Pool(10)
finished = 0
def job(url):
global finished
try:
try:
ip = socket.gethostbyname(url)
print ('%s = %s' % (url, ip))
except socket.gaierror:
ex = sys.exc_info()[1]
print ('%s failed with %s' % (url, ex))
finally:
finished += 1
with gevent.Timeout(2, False):
for x in xrange(10, 10 + N):
pool.spawn(job, '%s.com' % x)
pool.join()
print ('finished within 2 seconds: %s/%s' % (finished, N))

I don't know a simple solution. Using multiple threads/process would be complicated and would probably don't help that much, because your execution speed is bound to IO. Therefore I would have a look at some async lib like Twisted. There is a method resolve in IReactorCore: http://twistedmatrix.com/documents/12.2.0/api/twisted.internet.interfaces.IReactorCore.html

import thread
def resolve_one_domain(domain):
...
for domain in domains:
thread.start_new_thread(resolve_one_domain, [domain])

Related

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

override exception handling in class function of imported module

I have a definition inside of a class which handles an exception in a manner I don't like.
The class itself is inside a module, which itself is called by a module that I import.
the error handling I don't like looks like this:
class BitSharesWebsocket(Events):
#[snip]
def run_forever(self):
""" This method is used to run the websocket app continuously.
It will execute callbacks as defined and try to stay
connected with the provided APIs
"""
cnt = 0
while not self.run_event.is_set():
cnt += 1
self.url = next(self.urls)
log.debug("Trying to connect to node %s" % self.url)
try:
# websocket.enableTrace(True)
self.ws = websocket.WebSocketApp(
self.url,
on_message=self.on_message,
on_error=self.on_error,
on_close=self.on_close,
on_open=self.on_open
)
self.ws.run_forever()
except websocket.WebSocketException as exc:
if (self.num_retries >= 0 and cnt > self.num_retries):
raise NumRetriesReached()
sleeptime = (cnt - 1) * 2 if cnt < 10 else 10
if sleeptime:
log.warning(
"Lost connection to node during wsconnect(): %s (%d/%d) "
% (self.url, cnt, self.num_retries) +
"Retrying in %d seconds" % sleeptime
)
time.sleep(sleeptime)
I wish to preempt the exception here:
except websocket.WebSocketException as exc:
and handle it in my own way, namely to try a new address rather than trying the same address again and again.
I am presented with this exception when calling:
from bitshares.blockchain import Blockchain
from bitshares import BitShares
try:
chain = Blockchain(bitshares_instance=BitShares(n))
except:
print ('hello world')
pass
when n is a bad/unresponsive websocket address
I never get the 'hello world' message because the module handles the exception before I do.
the module is hosted at github here:
https://github.com/xeroc/python-bitshares/blob/9250544ca8eadf66de31c7f38fc37294c11f9548/bitsharesapi/websocket.py
I can do:
from bitsharesapi import websocket as ws
but I am not sure what to do with the module ws now that it is imported to preempt its exception handling, or if this is even the correct way to approach it.
I resolved my issue here:
chain = Blockchain(bitshares_instance=BitShares(n))
I was able to do:
chain = Blockchain(bitshares_instance=BitShares(n,num_retries=0))
I had previously tried this and assumed it wouldn't work:
chain = Blockchain(bitshares_instance=BitShares(n),num_retries=0)
*note parenthesis placement

In python co-routine, two successive socket.recv() causes BlockingIOError

I have a piece of python code to practice python co-routines.
As explained by A. Jesse Jiryu Davis.
Firstly, I define a co-routine named 'get' to get the content of some
URL.
Then I define a Task class to iterate the co-routine to completion.
Then I create one Task object which open one URL.
If I put two successive socket.recv() method in the co-routine, I got the error message:
'A non-blocking socket operation could not be completed immediately' in the second chunk = s.recv(1000) line.
But if I change all the yield to time.sleep(1) and directly call
get() in the global context, the two successive s.recv(1000) will
cause no errors. Even more successive s.recv(1000) is OK.
After several days searching and reading python documents, I still have no idea why this is happening. I must missed some 'python Gotchas', do I?
I'm using python 3.6 to test. The code is as following and I have deleted all the irrelevant code to make the following code precise and relevant to the topic:
#! /usr/bin/python
import socket
import select
import time
selectors_read = []
selectors_write = []
class Task:
def __init__(self, gen):
self.gen = gen
self.step()
def step(self):
try:
next(self.gen)
except StopIteration:
return
def get():
s = socket.socket()
selectors_write.append(s.fileno())
s.setblocking(False)
try:
s.connect(('www.baidu.com', 80))
except:
pass
yield
selectors_write.remove(s.fileno())
print('[CO-ROUTINE] ', 'Send')
selectors_read.append(s.fileno())
s.send('GET /index.html HTTP/1.0\r\n\r\n'.encode())
yield
while True:
chunk = s.recv(1000)
chunk = s.recv(1000)
if chunk:
print('[CO-ROUTINE] received')
else:
selectors_read.remove(s.fileno())
break
# yield
task_temp = Task(get())
while True:
for filenums in select.select(selectors_read, selectors_write, []):
for fd in filenums:
task_temp.step()

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

Multi-threaded S3 download doesn't terminate

I'm using python boto and threading to download many files from S3 rapidly. I use this several times in my program and it works great. However, there is one time when it doesn't work. In that step, I try to download 3,000 files on a 32 core machine (Amazon EC2 cc2.8xlarge).
The code below actually succeeds in downloading every file (except sometimes there is an httplib.IncompleteRead error that doesn't get fixed by the retries). However, only 10 or so of the 32 threads actually terminate and the program just hangs. Not sure why this is. All the files have been downloaded and all the threads should have exited. They do on other steps when I download fewer files. I've been reduced to downloading all these files with a single thread (which works but is super slow). Any insights would be greatly appreciated!
from boto.ec2.connection import EC2Connection
from boto.s3.connection import S3Connection
from boto.s3.key import Key
from boto.exception import BotoClientError
from socket import error as socket_error
from httplib import IncompleteRead
import multiprocessing
from time import sleep
import os
import Queue
import threading
def download_to_dir(keys, dir):
"""
Given a list of S3 keys and a local directory filepath,
downloads the files corresponding to the keys to the local directory.
Returns a list of filenames.
"""
filenames = [None for k in keys]
class DownloadThread(threading.Thread):
def __init__(self, queue, dir):
# call to the parent constructor
threading.Thread.__init__(self)
# create a connection to S3
connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
self.conn = connection
self.dir = dir
self.__queue = queue
def run(self):
while True:
key_dict = self.__queue.get()
print self, key_dict
if key_dict is None:
print "DOWNLOAD THREAD FINISHED"
break
elif key_dict == 'DONE': #last job for last worker
print "DOWNLOADING DONE"
break
else: #still work to do!
index = key_dict.get('idx')
key = key_dict.get('key')
bucket_name = key.bucket.name
bucket = self.conn.get_bucket(bucket_name)
k = Key(bucket) #clone key to use new connection
k.key = key.key
filename = os.path.join(dir, k.key)
#make dirs if don't exist yet
try:
f_dirname = os.path.dirname(filename)
if not os.path.exists(f_dirname):
os.makedirs(f_dirname)
except OSError: #already written to
pass
#inspired by: http://code.google.com/p/s3funnel/source/browse/trunk/scripts/s3funnel?r=10
RETRIES = 5 #attempt at most 5 times
wait = 1
for i in xrange(RETRIES):
try:
k.get_contents_to_filename(filename)
break
except (IncompleteRead, socket_error, BotoClientError), e:
if i == RETRIES-1: #failed final attempt
raise Exception('FAILED TO DOWNLOAD %s, %s' % (k, e))
break
wait *= 2
sleep(wait)
#put filename in right spot!
filenames[index] = filename
num_cores = multiprocessing.cpu_count()
q = Queue.Queue(0)
for i, k in enumerate(keys):
q.put({'idx': i, 'key':k})
for i in range(num_cores-1):
q.put(None) # add end-of-queue markers
q.put('DONE') #to signal absolute end of job
#Spin up all the workers
workers = [DownloadThread(q, dir) for i in range(num_cores)]
for worker in workers:
worker.start()
#Block main thread until completion
for worker in workers:
worker.join()
return filenames
Upgrade to AWS SDK version 1.4.4.0 or newer, or stick to exactly 2 threads. Older versions have a limit of at most 2 simultaneous connections. This means that your code will work well if you launch 2 threads; if you launch 3 or more, you are bound to see incomplete reads and exhausted timeouts.
You will see that while 2 threads can boost your throughput greatly, more than 2 does not change much because your network card is busy all the time anyway.
S3Connection uses httplib.py and that library is not threadsafe so ensuring each thread has it's own connection is critical. It looks like you are doing that.
Boto already has it's own retry mechanism but you are layering one on top of that to handle certain other errors. I wonder if it would be advisable to create a new S3Connection object inside the except block. It just seems like the underlying http connection could be in an unusual state at that point and it might be best to start with a fresh connection.
Just a thought.

Categories

Resources