Python multiprocess Process never terminates - python

My routine below takes a list of urllib2.Requests and spawns a new process per request and fires them off. The purpose is for asynchronous speed, so it's all fire-and-forget (no response needed). The issue is that the processes spawned in the code below never terminate. So after a few of these the box wilL OOM. Context: Django web app. Any help?
MP_CONCURRENT = int(multiprocessing.cpu_count()) * 2
if MP_CONCURRENT < 2: MP_CONCURRENT = 2
MPQ = multiprocessing.JoinableQueue(MP_CONCURRENT)
def request_manager(req_list):
try:
# put request list in the queue
for req in req_list:
MPQ.put(req)
# call processes on queue
worker = multiprocessing.Process(target=process_request, args=(MPQ,))
worker.daemon = True
worker.start()
# move on after queue is empty
MPQ.join()
except Exception, e:
logging.error(traceback.print_exc())
# prcoess requests in queue
def process_request(MPQ):
try:
while True:
req = MPQ.get()
dr = urllib2.urlopen(req)
MPQ.task_done()
except Exception, e:
logging.error(traceback.print_exc())

Maybe i am not right, but
MP_CONCURRENT = int(multiprocessing.cpu_count()) * 2
if MP_CONCURRENT < 2: MP_CONCURRENT = 2
MPQ = multiprocessing.JoinableQueue(MP_CONCURRENT)
def request_manager(req_list):
try:
# put request list in the queue
pool=[]
for req in req_list:
MPQ.put(req)
# call processes on queue
worker = multiprocessing.Process(target=process_request, args=(MPQ,))
worker.daemon = True
worker.start()
pool.append(worker)
# move on after queue is empty
MPQ.join()
# Close not needed processes
for p in pool: p.terminate()
except Exception, e:
logging.error(traceback.print_exc())
# prcoess requests in queue
def process_request(MPQ):
try:
while True:
req = MPQ.get()
dr = urllib2.urlopen(req)
MPQ.task_done()
except Exception, e:
logging.error(traceback.print_exc())

MP_CONCURRENT = int(multiprocessing.cpu_count()) * 2
if MP_CONCURRENT < 2: MP_CONCURRENT = 2
MPQ = multiprocessing.JoinableQueue(MP_CONCURRENT)
CHUNK_SIZE = 20 #number of requests sended to one process.
pool = multiprocessing.Pool(MP_CONCURRENT)
def request_manager(req_list):
try:
# put request list in the queue
responce=pool.map(process_request,req_list,CHUNK_SIZE) # function exits after all requests called and pool work ended
# OR
responce=pool.map_async(process_request,req_list,CHUNK_SIZE) #function request_manager exits after all requests passed to pool
except Exception, e:
logging.error(traceback.print_exc())
# prcoess requests in queue
def process_request(req):
dr = urllib2.urlopen(req)
This works ~5-10x faster then your code

Integrate side "brocker" to django (such as rabbitmq or something like it).

Ok after some fiddling (and a good night's sleep) I believe I've figured out the problem (and thank you Eri, you were the inspiration I needed). The main issue of the zombie processes was that I was not signaling back that the process was finished (and killing it) both of which I (naively) thought was happening automagically with multiprocess.
The code that worked:
# function that will be run through the pool
def process_request(req):
try:
dr = urllib2.urlopen(req, timeout=30)
except Exception, e:
logging.error(traceback.print_exc())
# process killer
def sig_end(r):
sys.exit()
# globals
MP_CONCURRENT = int(multiprocessing.cpu_count()) * 2
if MP_CONCURRENT < 2: MP_CONCURRENT = 2
CHUNK_SIZE = 20
POOL = multiprocessing.Pool(MP_CONCURRENT)
# pool initiator
def request_manager(req_list):
try:
resp = POOL.map_async(process_request, req_list, CHUNK_SIZE, callback=sig_end)
except Exception, e:
logging.error(traceback.print_exc())
A couple of notes:
1) The function that will be hit by "map_async" ("process_request" in this example) must be defined first (and before the global declarations).
2) There is probably a more graceful way to exit the process (suggestions welcome).
3) Using pool in this example really was best (thanks again Eri) due to the "callback" feature which allows me to throw a signal right away.

Related

Threading in python and Queue leads to memory leak or memory error

I had created simple code using multithreading in python with Queue. I have a main thread which keep on adding the Data in the Queue(Queue maxsize is 2000) and there will be 5 different threads who will take out from the Queue and publish into redis at some specific channel.
The code is working perfectly fine , but after 5 or 6 hours , the publish mechanism becomes slow.
As the threads which is used to remove the data from Queue becomes slow, and started throwing the buffer over flow error, as the Queue size reaches to maxsize. the speed of adding the data to queue is same which was in beginning.
The issue occurs differently on different configuration Linux system. How to identify what kind of error it is throwing? How to debug the problem.
As Stated - the code is very simple , where main thread is required to add the data in the Queue one by one and the 5 other threads can take the data out of the Queue one by one.
Sharing The Code
import redis
import logging
import sys
logging.basicConfig(level = logging.DEBUG)
redisPub = redis.StrictRedis(host='127.0.0.1', port=6379)
def main():
try:
recv_sock = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_UDP)
recv_sock.bind(("", 8000))
recv_sock.setblocking(0)
recv_sock.settimeout(5)
except Exception,e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.error("Socket Connection Unsuccessful")
print "Program halted."
sys.exit()
sendThEvent = threading.Event()
sendThEvent.clear()
sendPktQ = Queue.Queue(maxsize=2000)
for i in range(0,5):
thSend = threading.Thread(name = 'sendThread', target = sendThread , args = (sendPktQ,sendThEvent))
thSend.setDaemon(True)
thSend.start()
packetsCounting = 0
while True:
try:
recvData, recvAddr = recv_sock.recvfrom(2048)
sendPktQ.put(recvData)
packetsCounting += 1
else:
logging.error("There is some error")
except socket.timeout:
continue
except Exception,e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.error(e)
def sendThread(sendPktQ,listenEvent):
if sendPktQ == None:
pass
else:
while True:
instanceSentCnt += 1
if sendPktQ.qsize() < 1:
event_is_set = listenEvent.wait(0)
packetDict = sendPktQ.get()
data = redisPub.publish('chnlName', packetToSend)
logging.info("packet size reached to ============================================ %s ------------ %s"%(len(packetToSend),data))
if __name__ == "__main__":
main()
Anyone Inputs will be highly appreciated.
I see a sendPktQ.get(), but no sendPktQ.task_done(), for example:
async def _dispatch_packet(self, destination=None) -> None:
"""Send a command unless in listen_only mode."""
if not self.command_queue.empty():
cmd = self.command_queue.get()
if not (destination is None or self.config.get("listen_only")):
destination.write(bytearray(f"{cmd}\r\n".encode("ascii")))
await asyncio.sleep(0.05)
self.command_queue.task_done()
See queue library docs

Python Multi-threaded App does not terminate

This my code which basically just takes a list of 94,000+ URLs, and collects the http_status codes for them:
#!/usr/bin/python3
import threading
from queue import Queue
import urllib.request
import urllib.parse
from http.client import HTTPConnection
import socket
import http.client
#import httplib
url_input = open("urls_prod_sort.txt", "r").read()
urls = url_input[:url_input.rfind('\n')].split('\n')
#urls = urls[:100]
url_502 = []
url_logs = []
url_502_lock = threading.Lock()
print_lock = threading.Lock()
def sendRequest(url_u, http_method = 'GET', data = None):
use_proxy = "http://xxxxxxxx:8080"
proxies = {"http": use_proxy}
proxy = urllib.request.ProxyHandler(proxies)
handler = urllib.request.HTTPHandler()
url = "http://" + url_u
with print_lock:
print(url)
opener = urllib.request.build_opener(proxy,handler)
urllib.request.install_opener(opener)
request = urllib.request.Request(url,data)
request.add_header("User-agent","| MSIE |")
request.get_method = lambda: http_method
try:
response = urllib.request.urlopen(request)
response_code = response.code
except urllib.error.HTTPError as error:
response_code = error.code
except urllib.error.URLError as e2:
response_code = 701
except socket.timeout as e3:
response_code = 702
except socket.error as e4:
response_code = 703
except http.client.IncompleteRead as e:
response_code = 700
if response_code == 502:
with url_502_lock:
#url_502.append(url)
url_502_file = open("url_502_file.txt", "a")
url_502_file.write(url + "\n")
url_502_file.close()
with print_lock:
#url_logs.append(url + "," + str(response_code))
url_all_logs_file = open("url_all_logs.csv", "a")
url_all_logs_file.write(url + "," + str(response_code) + '\n')
url_all_logs_file.close()
#print (url + "," + str(response_code))
#print (response_code)
return response_code
def worker():
while True:
url = q.get()
if url == ":::::"
break
else:
sendRequest(url)
q.task_done()
#======================================
q = Queue()
for threads in range(1000):
t = threading.Thread(target = worker)
t.daemon = True
t.start()
for url in urls:
q.put(url)
q.put(":::::")
q.join()
However, the program never seems to terminate (even tho the URLs have all been iteratred through) which forces me to ctrl-c the program - and then I get the following error:
Traceback (most recent call last):
File "./url_sc_checker.py", line 120, in <module>
q.join()
File "/usr/lib/python3.2/queue.py", line 82, in join
self.all_tasks_done.wait()
File "/usr/lib/python3.2/threading.py", line 235, in wait
waiter.acquire()
KeyboardInterrupt
The reason that your program doesn't terminate is simple, your worker creates an infinite loop:
def worker():
while True:
...
You need to either throw an exception, break, or have a terminating condition in your while statement. Otherwise your program would remain trying to get the next job from the queue, without knowing that there will never be the next job.
A common way to do this is to put a sentinel value in your queue, when checking out a job from the queue, the worker checks if it is the sentinel value and breaks out the loop.
Another way is to have a global condition variable that you check in the while condition. When the job producer have pushed all items to the queue, the job producer joins the queue, and when all jobs are done, the job producer unblocks and terminates the threads our processes.
Another possible reason why your process doesn't terminate is if your sendRequest produces an unexpected exception, then the thread terminates and you'll be left with some jobs that are never marked as done.

python-requests with multithreading

I am working on creating a HTTP client which can generate hundreds of connections each second and send up to 10 requests on each of those connections. I am using threading so concurrency can be achieved.
Here is my code:
def generate_req(reqSession):
requestCounter = 0
while requestCounter < requestRate:
try:
response1 = reqSession.get('http://20.20.1.2/tempurl.html')
if response1.status_code == 200:
client_notify('r')
except(exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout) as Err:
client_notify('F')
break
requestCounter += 1
def main():
for q in range(connectionPerSec):
s1 = requests.session()
t1 = threading.Thread(target=generate_req, args=(s1,))
t1.start()
Issues:
It is not scaling above 200 connections/sec with requestRate = 1. I ran other available HTTP clients on the same client machine and against the server, test runs fine and it is able to scale.
When requestRate = 10, connections/sec drops to 30.
Reason: Not able to create targeted number of threads every second.
For issue #2, client machine is not able to create enough request sessions and start new threads. As soon as requestRate is set to more than 1, things start to fall apart.
I am suspecting it has something to do with HTTP connection pooling which requests uses.
Please suggest what am I doing wrong here.
I wasn't able to get things to fall apart, however the following code has some new features:
1) extended logging, including specific per-thread information
2) all threads join()ed at the end to make sure the parent process doesntt leave them hanging
3) multithreaded print tends to interleave the messages, which can be unwieldy. This version uses yield so a future version can accept the messages and print them clearly.
source
import exceptions, requests, threading, time
requestRate = 1
connectionPerSec = 2
def client_notify(msg):
return time.time(), threading.current_thread().name, msg
def generate_req(reqSession):
requestCounter = 0
while requestCounter < requestRate:
try:
response1 = reqSession.get('http://127.0.0.1/')
if response1.status_code == 200:
print client_notify('r')
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
print client_notify('F')
break
requestCounter += 1
def main():
for cnum in range(connectionPerSec):
s1 = requests.session()
th = threading.Thread(
target=generate_req, args=(s1,),
name='thread-{:03d}'.format(cnum),
)
th.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()
if __name__=='__main__':
main()
output
(1407275951.954147, 'thread-000', 'r')
(1407275951.95479, 'thread-001', 'r')

Python, send a stop notification to a blocking loop within a thread

I've read many answers, however I have not found a proper solution.
The problem, I'm reading mixed/replace HTTP streams that will not expire or end by default.
You can try it by yourself using curl:
curl http://agent.mtconnect.org/sample\?interval\=0
So, now I'm using Python threads and requests to read data from multiple streams.
import requests
import uuid
from threading import Thread
tasks = ['http://agent.mtconnect.org/sample?interval=5000',
'http://agent.mtconnect.org/sample?interval=10000']
thread_id = []
def http_handler(thread_id, url, flag):
print 'Starting task %s' % thread_id
try:
requests_stream = requests.get(url, stream=True, timeout=2)
for line in requests_stream.iter_lines():
if line:
print line
if flag and line.endswith('</MTConnectStreams>'):
# Wait until XML message end is reached to receive the full message
break
except requests.exceptions.RequestException as e:
print('error: ', e)
except BaseException as e:
print e
if __name__ == '__main__':
for task in tasks:
uid = str(uuid.uuid4())
thread_id.append(uid)
t = Thread(target=http_handler, args=(uid, task, False), name=uid)
t.start()
print thread_id
# Wait Time X or until user is doing something
# Send flag = to desired thread to indicate the loop should stop after reaching the end.
Any suggestions? What is the best solution? I don't want to kill the thread because I would like to read the ending to have a full XML message.
I found a solution by using threading module and threading.events. Maybe not the best solution, but it works fine currently.
import logging
import threading
import time
import uuid
import requests
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-10s) %(message)s', )
tasks = ['http://agent.mtconnect.org/sample?interval=5000',
'http://agent.mtconnect.org/sample?interval=10000']
d = dict()
def http_handler(e, url):
logging.debug('wait_for_event starting')
message_buffer = []
filter_namespace = True
try:
requests_stream = requests.get(url, stream=True, timeout=2)
for line in requests_stream.iter_lines():
if line:
message_buffer.append(line)
if e.isSet() and line.endswith('</MTConnectStreams>'):
logging.debug(len(message_buffer))
break
except requests.exceptions.RequestException as e:
print('error: ', e)
except BaseException as e:
print e
if __name__ == '__main__':
logging.debug('Waiting before calling Event.set()')
for task in tasks:
uid = str(uuid.uuid4())
e = threading.Event()
d[uid] = {"stop_event": e}
t = threading.Event(uid)
t = threading.Thread(name=uid,
target=http_handler,
args=(e, task))
t.start()
logging.debug('Waiting 3 seconds before calling Event.set()')
for key in d:
time.sleep(3)
logging.debug(threading.enumerate())
logging.debug(d[key])
d[key]['stop_event'].set()
logging.debug('bye')

Threads not stop in python

The purpose of my program is to download files with threads. I define the unit, and using len/unit threads, the len is the length of the file which is going to be downloaded.
Using my program, the file can be downloaded, but the threads are not stopping. I can't find the reason why.
This is my code...
#! /usr/bin/python
import urllib2
import threading
import os
from time import ctime
class MyThread(threading.Thread):
def __init__(self,func,args,name=''):
threading.Thread.__init__(self);
self.func = func;
self.args = args;
self.name = name;
def run(self):
apply(self.func,self.args);
url = 'http://ubuntuone.com/1SHQeCAQWgIjUP2945hkZF';
request = urllib2.Request(url);
response = urllib2.urlopen(request);
meta = response.info();
response.close();
unit = 1000000;
flen = int(meta.getheaders('Content-Length')[0]);
print flen;
if flen%unit == 0:
bs = flen/unit;
else :
bs = flen/unit+1;
blocks = range(bs);
cnt = {};
for i in blocks:
cnt[i]=i;
def getStr(i):
try:
print 'Thread %d start.'%(i,);
fout = open('a.zip','wb');
fout.seek(i*unit,0);
if (i+1)*unit > flen:
request.add_header('Range','bytes=%d-%d'%(i*unit,flen-1));
else :
request.add_header('Range','bytes=%d-%d'%(i*unit,(i+1)*unit-1));
#opener = urllib2.build_opener();
#buf = opener.open(request).read();
resp = urllib2.urlopen(request);
buf = resp.read();
fout.write(buf);
except BaseException:
print 'Error';
finally :
#opener.close();
fout.flush();
fout.close();
del cnt[i];
# filelen = os.path.getsize('a.zip');
print 'Thread %d ended.'%(i),
print cnt;
# print 'progress : %4.2f'%(filelen*100.0/flen,),'%';
def main():
print 'download at:',ctime();
threads = [];
for i in blocks:
t = MyThread(getStr,(blocks[i],),getStr.__name__);
threads.append(t);
for i in blocks:
threads[i].start();
for i in blocks:
# print 'this is the %d thread;'%(i,);
threads[i].join();
#print 'size:',os.path.getsize('a.zip');
print 'download done at:',ctime();
if __name__=='__main__':
main();
Could someone please help me understand why the threads aren't stopping.
I can't really address your code example because it is quite messy and hard to follow, but a potential reason you are seeing the threads not end is that a request will stall out and never finish. urllib2 allows you to specify timeouts for how long you will allow the request to take.
What I would recommend for your own code is that you split your work up into a queue, start a fixed number of thread (instead of a variable number), and let the worker threads pick up work until it is done. Make the http requests have a timeout. If the timeout expires, try again or put the work back into the queue.
Here is a generic example of how to use a queue, a fixed number of workers and a sync primitive between them:
import threading
import time
from Queue import Queue
def worker(queue, results, lock):
local_results = []
while True:
val = queue.get()
if val is None:
break
# pretend to do work
time.sleep(.1)
local_results.append(val)
with lock:
results.extend(local_results)
print threading.current_thread().name, "Done!"
num_workers = 4
threads = []
queue = Queue()
lock = threading.Lock()
results = []
for i in xrange(100):
queue.put(i)
for _ in xrange(num_workers):
# Use None as a sentinel to signal the threads to end
queue.put(None)
t = threading.Thread(target=worker, args=(queue,results,lock))
t.start()
threads.append(t)
for t in threads:
t.join()
print sorted(results)
print "All done"

Categories

Resources