threading.Thread getting stuck on calling httplib2.Http.request - python

The first few lines of the script explain the structure and the mechanism.
The problem that I'm facing is that the execution is getting stuck at line 53. Once the Downloader acts on the first request it generates the api correctly however on reaching http_object.request(audioscrobbler_api) it gets stuck.
The script was coded and tested on another system and it yielded the correct result.
I can confirm that the httplib2 package is not broken as it functions properly while methods of that library (including request) are called from other scripts.
What is causing the script to get stuck ?
Script:
#
# Album artwork downloading module for Encore Music Player application.
# Loosely based on the producer-consumer model devised by E W Djikstra.
#
# The Downloader class (implemented as a daemon thread) acts as the consumer
# in the system where it reads requests from the buffer and tries to fetch the
# artwork from ws.audioscrobbler.com (LastFM's web service portal).
#
# Requester class, the producer, is a standard thread class that places the request
# in the buffer when started.
#
# DBusRequester class provides an interface to the script and is made available on
# the session bus of the DBus daemon under the name of 'com.encore.AlbumArtDownloader'
# which enables the core music player to request downloads.
#
import threading, urllib, httplib2, md5, libxml2, os, dbus, dbus.service, signal
from collections import deque
from gi.repository import GObject
from dbus.mainloop.glib import DBusGMainLoop
requests = deque()
mutex = threading.Lock()
count = threading.Semaphore(0)
DBusGMainLoop(set_as_default = True)
class Downloader(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
while True:
print "=> Downloader waiting for requests"
count.acquire() # wait for new request if buffer is empty
mutex.acquire() # enter critical section
request = requests.popleft()
mutex.release() # leave critical section
(p, q) = request
try:
print "=> Generating api for %s by %s" % (p,q)
params = urllib.urlencode({'method': 'album.getinfo', 'api_key': 'XXX', 'artist': p, 'album': q})
audioscrobbler_api = "http://ws.audioscrobbler.com/2.0/?%s" % params
print "=> Generated URL %s" % (audioscrobbler_api)
http_object = httplib2.Http()
print "=> Requesting response"
resp, content = http_object.request(audioscrobbler_api)
print "=> Received response"
if not resp.status == 200:
print "Unable to fetch artwork for %s by %s" % (q, p)
continue # proceed to the next item in queue if request fails
doc = libxml2.parseDoc(content)
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//image[#size='medium']") # grab the element containing the link to a medium sized artwork
if len(res) < 1:
continue # proceed to the next item in queue if the required image node is not found
image_uri = res[0].content # extract uri from node
wget_status = os.system("wget %s -q --tries 3 -O temp" % (image_uri))
if not wget_status == 0:
continue # proceed to the next item in queue if download fails
artwork_name = "%s.png" % (md5.md5("%s + %s" % (p, q)).hexdigest())
os.system("convert temp -resize 64x64 %s" % artwork_name)
except:
pass # handle http request error
class Requester(threading.Thread):
def __init__(self, request):
self.request = request
threading.Thread.__init__(self)
def run(self):
mutex.acquire() # enter critical section
if not self.request in requests:
requests.append(self.request)
count.release() # signal downloader
mutex.release() # leave critical section
class DBusRequester(dbus.service.Object):
def __init__(self):
bus_name = dbus.service.BusName('com.encore.AlbumArtDownloader', bus=dbus.SessionBus())
dbus.service.Object.__init__(self, bus_name, '/com/encore/AlbumArtDownloader')
#dbus.service.method('com.encore.AlbumArtDownloader')
def queue_request(self, artist_name, album_name):
request = (artist_name, album_name)
requester = Requester(request)
requester.start()
def sigint_handler(signum, frame):
"""Exit gracefully on receiving SIGINT."""
loop.quit()
signal.signal(signal.SIGINT, sigint_handler)
downloader_daemon = Downloader()
downloader_daemon.daemon = True
downloader_daemon.start()
requester_service = DBusRequester()
loop = GObject.MainLoop()
loop.run()
On doing a Ctrl-C
=> Downloader waiting for requests
=> Generating api for paul van dyk by evolution
=> Generated URL http://ws.audioscrobbler.com/2.0/?album=evolution&api_key=XXXXXXXXXXXXXXXXXXXX&method=album.getinfo&artist=paul+van+dyk
=> Requesting response
^C
Thanks !!

When your script gets stuck at line 53, can you break the execution using Ctrl + C and show us the traceback python gives?

The problem was caused by Python's Global Interpreter Lock (GIL).
GObject.threads_init()
fixes the problem.

Related

What is the recommended way of sharing a HTTP connection pool between threads with requests?

I'm working in an environment where web applications fork processes on demand and each process has its own thread pool to service web requests. The threads may need to issue HTTPS requests to outside services, and the requests library is currently used to do so. When requests usage was first added, it was used naively by creating a new requests.Session and requests.adapters.HTTPAdapter for each request, or even by simply calling requests.get or requests.post on demand. The problem that arises is that a new connection is established each time instead of potentially taking advantage of HTTP persistent connections. A potential fix would be to use a connection pool, but what is the recommended way of sharing a HTTP connection pool between threads when using the requests library? Is there one?
The first thought would be to share a single requests.Session, but that currently not safe, as described in "Is the Session object from Python's Requests library thread safe?" and "Document threading contract for Session class". Is it safe and sufficient to have a single global requests.adapters.HTTPAdapter that is shared between requests.Sessionss that are created on demand in each thread? According to "Our use of urllib3's ConnectionPools is not threadsafe.", even that may not be a valid use. Only needing to connect to a small number of distinct remote endpoints may allow it to be a viable approach regardless.
I doubt there is existing way to do this in requests. But you can modify my code to encapsulate requests session() instead of standard urllib2.
This is my code that I use when I want to get data from multiple sites at the same time:
# Following code I keep in file named do.py
# It can be use to perform any number of otherwise blocking IO operations simultaneously
# Results are available to you when all IO operations are completed.
# Completed means either IO finished successfully or an exception got raised.
# So when all tasks are completed, you pick up results.
# Usage example:
# >>> import do
# >>> results = do.simultaneously([
# ... (func1, func1_args, func1_kwargs),
# ... (func2, func2_args, func2_kwargs), ...])
# >>> for x in results:
# ... print x
# ...
from thread import start_new_thread as thread
from thread import allocate_lock
from collections import deque
from time import sleep
class Task:
"""A task's thread holder. Keeps results or exceptions raised.
This could be a bit more robustly implemented using
threading module.
"""
def __init__ (self, func, args, kwargs, pool):
self.func = func
self.args = args
self.kwargs = kwargs
self.result = None
self.done = 0
self.started = 0
self.xraised = 0
self.tasks = pool
pool.append(self)
self.allow = allocate_lock()
self.run()
def run (self):
thread(self._run,())
def _run (self):
self.allow.acquire() # Prevent same task from being started multiple times
self.started = 1
self.result = None
self.done = 0
self.xraised = 0
try:
self.result = self.func(*self.args, **self.kwargs)
except Exception, e:
e.task = self # Keep reference to the task in an exception
# This way we can access original task from caught exception
self.result = e
self.xraised = 1
self.done = 1
self.allow.release()
def wait (self):
while not self.done:
try: sleep(0.001)
except: break
def withdraw (self):
if not self.started: self.run()
if not self.done: self.wait()
self.tasks.remove(self)
return self.result
def remove (self):
self.tasks.remove(self)
def simultaneously (tasks, xraise=0):
"""Starts all functions within iterable <tasks>.
Then waits for all to be finished.
Iterable <tasks> may contain a subiterables with:
(function, [[args,] kwargs])
or just functions. These would be called without arguments.
Returns an iterator that yields result of each called function.
If an exception is raised within a task the Exception()'s instance will be returned unless
is 1 or True. Then first encountered exception within results will be raised.
Results will start to yield after all funcs() either return or raise an exception.
"""
pool = deque()
for x in tasks:
func = lambda: None
args = ()
kwargs = {}
if not isinstance(x, (tuple, list)):
Task(x, args, kwargs, pool)
continue
l = len(x)
if l: func = x[0]
if l>1:
args = x[1]
if not isinstance(args, (tuple, list)): args = (args,)
if l>2:
if isinstance(x[2], dict):
kwargs = x[2]
Task(func, args, kwargs, pool)
for t in pool: t.wait()
while pool:
t = pool.popleft()
if xraise and t.xraised:
raise t.result
yield t.result
# So, I do this using urllib2, you can do it using requests if you want.
from urllib2 import URLError, HTTPError, urlopen
import do
class AccessError(Exception):
"""Raised if server rejects us because we bombarded same server with multiple connections in too small time slots."""
pass
def retrieve (url):
try:
u = urlopen(url)
r = u.read()
u.close()
return r
except HTTPError, e:
msg = "HTTPError %i - %s" % (e.code, e.msg)
t = AccessError()
if e.code in (401, 403, 429):
msg += " (perhaps you're making too many calls)"
t.reason = "perhaps you are making too many calls"
elif e.code in (502, 504):
msg += " (service temporarily not available)"
t.reason = "service temporarily not available"
else: t.reason = e.msg
t.args = (msg,)
t.message = msg
t.msg = e.msg; t.code = e.code
t.orig = e
raise t
except URLError, e:
msg = "URLError %s - %s (%s)" % (str(e.errno), str(e.message), str(e.reason))
t = AccessError(msg)
t.reason = str(e.reason)
t.msg = str(t.message)
t.code = e.errno
t.orig = e
raise t
except: raise
urls = ["http://www.google.com", "http://www.amazon.com", "http://stackoverflow.com", "http://blah.blah.sniff-sniff"]
retrieval = []
for u in urls:
retrieval.append((retrieve, u))
x = 0
for data in do.simultaneously(retrieval):
url = urls[x]
if isinstance(data, Exception):
print url, "not retrieved successfully!\nThe error is:"
print data
else:
print url, "returned", len(data), "characters!!\nFirst 100:"
print data[:100]
x += 1
# If you need persistent HTTP, you tweak the retrieve() function to be able to hold the connection open.
# After retrieving currently requested data You save opened connections in a global dict() with domains as keys.
# When the next retrieve is called and the domain already has an opened connection, you remove the connection from dict (to prevent any other retrieve grabbing it in the middle of nowhere), then you use it
# to send a new request if possible. (If it isn't timed out or something), if connection broke, you just open a new one.
# You will probably have to introduce some limits if you will be using multiple connections to same server at once.
# Like no more than 4 connections to same server at once, some delays between requests and so on.
# No matter which approach to multithreading you will choose (something like I propose or some other mechanism) thread safety is in trouble because HTTP is serialized protocol.
# You send a request, you await the answer. After you receive whole answer, then you can make a new request if HTTP/1.1 is used and connection is being kept alive.
# If your thread tries to send a new request during the data download a general mess will occur.
# So you design your system to open as much connections as possible, but always wait for one to be free before reusing it. Thats a trick here.
# As for any other part of requests being thread unsafe for some reason, well, you should check the code to see which calls exactly should be kept atomic and then use a lock. But don't put it in a block where major IO is occurring or it will be as you aren't using threads at all.

Random freezing / hanging in Python ZeroMQ

I am writing a broker-less, balanced, client-worker service written in python with ZeroMQ.
The clients acquire a worker's address, establish a connection ( zmq.REQ / zmq.REP ), send single request, receive a single response and then disconnect.
I have chosen a broker-less architecture because the amount of a data that needs to get transferred between the clients and workers is relatively large, despite there only being a single REQ/REP pair per connection, and using a broker as a 'middle man' would create a bottleneck.
While testing the system, I noticed that the communication between the clients and workers was halting randomly, only sometimes resuming after a couple of seconds (often several minutes).
I narrowed down the issue to the .connect() / .disconnect() of clients to workers.
I have written two small python scripts that reproduce the bug.
import zmq
class Site:
def __init__(self):
ctx = zmq.Context()
self.pair_socket = ctx.socket(zmq.REQ)
self.num = 0
def __del__(self):
print "closed"
def run_site(self):
print "running..."
while True:
self.pair_socket.connect('tcp://127.0.0.1:5555')
print 'connected'
self.pair_socket.send_pyobj(self.num)
print 'sent', self.num
print self.pair_socket.recv_pyobj()
self.pair_socket.disconnect('tcp://127.0.0.1:5555')
print 'disconnected'
self.num += 1
s = Site()
s.run_site()
and
import zmq
class Server:
def __init__(self):
ctx = zmq.Context()
self.pair_socket = ctx.socket(zmq.REP)
self.pair_socket.bind('tcp://127.0.0.1:5555')
def __del__(self):
print " closed"
def run_server(self):
print "running..."
while True:
x = self.pair_socket.recv_pyobj()
print x
self.pair_socket.send_pyobj(x)
s = Server()
s.run_server()
I don't think the issue is related to memory or gc as I have tried disabling gc - without much affect.
I have tried using zmq.LINGER as described here: Zeromq with python hangs if connecting to invalid socket
What could cause these randoms freezes?
The REP socket is synchronous by definition. So your server can only serve one request at a time, rest of them will just fill up the buffer and get lost at some point.
To fix the root cause, you need to use the ROUTER socket instead.
class Server:
def __init__(self):
ctx = zmq.Context()
self.pair_socket = ctx.socket(zmq.ROUTER)
self.pair_socket.bind('tcp://127.0.0.1:5555')
self.poller = zmq.Poller()
self.poller.register(self.pair_socket, zmq.POLLIN)
def __del__(self):
print " closed"
def run_server(self):
print "running..."
while True:
try:
items = dict(self.poller.poll())
except KeyboardInterrupt:
break
if self.pair_socket in items:
x = self.pair_socket.recv_multipart()
print x
self.pair_socket.send_multipart(x)

Thread polling sqs and adding it to a python queue for processing dies

I have a piece of multi threaded code - 3 threads that polls data from SQS and add it to a python queue. 5 threads that take the messages from python queue, process them and send it to a back end system.
Here is the code:
python_queue = Queue.Queue()
class GetDataFromSQS(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, python_queue):
threading.Thread.__init__(self)
self.python_queue = python_queue
def run(self):
while True:
time.sleep(0.5) //sleep for a few secs before querying again
try:
msgs = sqs_queue.get_messages(10)
if msgs == None:
print "sqs is empty now"!
for msg in msgs:
#place each message block from sqs into python queue for processing
self.python_queue.put(msg)
print "Adding a new message to Queue. Queue size is now %d" % self.python_queue.qsize()
#delete from sqs
sqs_queue.delete_message(msg)
except Exception as e:
print "Exception in GetDataFromSQS :: " + e
class ProcessSQSMsgs(threading.Thread):
def __init__(self, python_queue):
threading.Thread.__init__(self)
self.python_queue = python_queue
self.pool_manager = PoolManager(num_pools=6)
def run(self):
while True:
#grabs the message to be parsed from sqs queue
python_queue_msg = self.python_queue.get()
try:
processMsgAndSendToBackend(python_queue_msg, self.pool_manager)
except Exception as e:
print "Error parsing:: " + e
finally:
self.python_queue.task_done()
def processMsgAndSendToBackend(msg, pool_manager):
if msg != "":
###### All the code related to processing the msg
for individualValue in processedMsg:
try:
response = pool_manager.urlopen('POST', backend_endpoint, body=individualValue)
if response == None:
print "Error"
else:
response.release_conn()
except Exception as e:
print "Exception! Post data to backend: " + e
def startMyPython():
#spawn a pool of threads, and pass them queue instance
for i in range(3):
sqsThread = GetDataFromSQS(python_queue)
sqsThread.start()
for j in range(5):
parseThread = ProcessSQSMsgs(python_queue)
#parseThread.setDaemon(True)
parseThread.start()
#wait on the queue until everything has been processed
python_queue.join()
# python_queue.close() -- should i do this?
startMyPython()
The problem:
3 python workers die randomly (monitored using top -p -H) once every few days and everything is alright if i kill the process and start the script again. I suspect the workers that vanish are the 3 GetDataFromSQS threads.. And because the GetDataFromSQS dies, the other 5 workers although running always sleep as there is no data in the python queue. I am not sure what I am doing wrong here as I am pretty new to python and followed this tutorial for creating this queuing logic and threads - http://www.ibm.com/developerworks/aix/library/au-threadingpython/
Thanks in advance for your help. Hope I have explained my problem clear.
The problem for the thread hanging was related to getting a handle of the sqs queue. I used IAM for managing credentials and the boto sdk for connecting to sqs.
The root cause for this issue was that the boto package was reading the metadata for auth from AWS and it was failing once in a while.
The fix is to edit the boto config, increasing the attempts that are made to perform the auth call to AWS.
[Boto]
metadata_service_num_attempts = 5
( https://groups.google.com/forum/#!topic/boto-users/1yX24WG3g1E )

twisted client / server communication woes

I'm trying to make a simple distributed job client/server system in Twisted. Basically the steps are:
Start up the JobServer with a few jobs and associated files
Start up JobClient instances, they connect to JobServer and ask for Jobs
Server gives JobClient job and sends serialized JSON over TCP
After perhaps a lot of computation, the JobClient sends back a result and waits for new job
Rinse and repeat
But I'm having trouble debugging my protocol on a local machine.
JobServer.py
from twisted.application import internet, service
from twisted.internet import reactor, protocol, defer
from twisted.protocols import basic
from twisted.protocols.basic import Int32StringReceiver
from twisted.web import client
import random
import json
import base64
from logger import JobLogger
class JobServerProtocol(Int32StringReceiver):
log = JobLogger("server.log")
def connectionMade(self):
self.log.write("Connected to client")
self.sendJob(None)
def stringReceived(self, msg):
self.log.write("Recieved job from client: %s" % msg)
self.sendJob(msg)
def sendJob(self, msg):
d = self.factory.getJob(msg)
def onError(err):
self.transport.write("Internal server error")
d.addErrback(onError)
def sendString(newjob_dict):
encoded_str = json.dumps(newjob_dict)
self.transport.write(encoded_str)
self.log.write("Sending job to client: %s" % encoded_str)
d.addCallback(sendString)
def lengthLimitExceeded(self, msg):
self.transport.loseConnection()
class JobServerFactory(protocol.ServerFactory):
protocol = JobServerProtocol
def __init__(self, jobs, files):
assert len(jobs) == len(files)
self.jobs = jobs
self.files = files
self.results = []
def getJob(self, msg):
# on startup the client will not have a message to send
if msg:
# recreate pickled msg
msg_dict = json.loads(msg)
self.results.append((msg_dict['result'], msg_dict['jidx']))
# if we're all done, let the client know
if len(self.jobs) == 0:
job = None
jidx = -1
encoded = ""
else:
# get new job for client to process
jidx = random.randint(0, len(self.jobs) - 1)
job = self.jobs[jidx]
del self.jobs[jidx]
# get file
with open(self.files[jidx], 'r') as f:
filecontents = f.read()
encoded = base64.b64encode(filecontents)
# create dict object to send to client
response_msg = {
"job" : job,
"index" : jidx,
"file" : encoded
}
return defer.succeed(response_msg)
# args for factory
files = ['test.txt', 'test.txt', 'test.txt']
jobs = ["4*4-5", "2**2-5", "2/9*2/3"]
application = service.Application('jobservice')
factory = JobServerFactory(jobs=jobs, files=files)
internet.TCPServer(12345, factory).setServiceParent(
service.IServiceCollection(application))
JobClient.py
from twisted.internet import reactor, protocol
from twisted.protocols.basic import Int32StringReceiver
import json
import time
from logger import JobLogger
class JobClientProtocol(Int32StringReceiver):
log = JobLogger("client.log")
def stringReceived(self, msg):
# unpack job from server
server_msg_dict = json.loads(msg)
job = server_msg_dict["job"]
index = server_msg_dict["index"]
filestring = server_msg_dict["file"]
if index == -1:
# we're done with all tasks
self.transport.loseConnection()
self.log.write("Recieved job %d from server with file '%s'" % (index, filestring))
# do something with file
# job from the server...
time.sleep(5)
result = { "a" : 1, "b" : 2, "c" : 3}
result_msg = { "result" : result, "jidx" : index }
self.log.write("Completed job %d from server with result '%s'" % (index, result))
# serialize and tell server
result_str = json.dumps(result_msg)
self.transport.write(encoded_str)
def lengthLimitExceeded(self, msg):
self.transport.loseConnection()
class JobClientFactory(protocol.ClientFactory):
def buildProtocol(self, addr):
p = JobClientProtocol()
p.factory = self
return p
reactor.connectTCP("127.0.0.1", 12345, JobClientFactory())
reactor.run()
logging.py
class JobLogger(object):
def __init__(self, filename):
self.log = open(filename, 'a')
def write(self, string):
self.log.write("%s\n" % string)
def close(self):
self.log.close()
Running, testing locally with only one client:
$ twistd -y JobServer.py -l ./jobserver.log --pidfile=./jobserver.pid
$ python JobClient.py
Problems I'm having:
The client and server .log files don't get written to reliably - sometimes not until after I kill the process.
The protocol gets stuck after the client connects and the server sends back a message. The message seemingly never gets to the client.
In general, I hope these protocols ensure that operations on either side can take any amount of time, but perhaps I didn't design that correctly.
The client and server .log files don't get written to reliably - sometimes not until after I kill the process.
If you want bytes to appear on disk in a timely manner, you may need to call flush on your file object.
The protocol gets stuck after the client connects and the server sends back a message. The message seemingly never gets to the client.
The server doesn't send int32 strings to the client: it calls transport.write directly. The client gets confused because these end up looking like extremely long int32 strings. For example, the first four bytes of "Internal server error" decode as the integer 1702129225 so if there is an error on the server and these bytes are sent to the client, the client will wait for roughly 2GB of data before proceeding.
Use Int32StringReceiver.sendString instead.

How to call different URLs in Python for load testing?

I have a code that perform load testing against any specific url. But I have to do load testing of a web service that has different URLs. To do so, I need to make an array of URLs and each thread should hit all the URLs given in an array. How can I do this? This is my code:
import httplib2
import socket
import time
from threading import Event
from threading import Thread
from threading import current_thread
from urllib import urlencode
# Modify these values to control how the testing is done
# How many threads should be running at peak load.
NUM_THREADS = 50
# How many minutes the test should run with all threads active.
TIME_AT_PEAK_QPS = 20 # minutes
# How many seconds to wait between starting threads.
# Shouldn't be set below 30 seconds.
DELAY_BETWEEN_THREAD_START = 30 # seconds
quitevent = Event()
def threadproc():
"""This function is executed by each thread."""
print "Thread started: %s" % current_thread().getName()
h = httplib2.Http(timeout=30)
while not quitevent.is_set():
try:
# HTTP requests to exercise the server go here
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
resp, content = h.request(
"http://www.google.com")
if resp.status != 200:
print "Response not OK"
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
except socket.timeout:
pass
print "Thread finished: %s" % current_thread().getName()
if __name__ == "__main__":
runtime = (TIME_AT_PEAK_QPS * 60 + DELAY_BETWEEN_THREAD_START * NUM_THREADS)
print "Total runtime will be: %d seconds" % runtime
threads = []
try:
for i in range(NUM_THREADS):
t = Thread(target=threadproc)
t.start()
threads.append(t)
time.sleep(DELAY_BETWEEN_THREAD_START)
print "All threads running"
time.sleep(TIME_AT_PEAK_QPS*60)
print "Completed full time at peak qps, shutting down threads"
except:
print "Exception raised, shutting down threads"
quitevent.set()
time.sleep(3)
for t in threads:
t.join(1.0)
print "Finished"
Instead of passing a threadproc to Thread, extend the class:
class Worker(Thread):
def __init__(self, urls):
super(Worker, self).__init__()
self.urls = urls
def run(self):
for url in self.urls:
self.fetch(url)
That said, unless you do this to get a better understanding of threading and how load testing works internally, I suggest to use a mature testing framework like Jmeter instead. Years of experience went into it which you'd have to accumulate, first.

Categories

Resources