Fasted way to submit tasks with celery? - python

I'm trying to submit around 150 million jobs to celery using the following code:
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
for line in alldat:
try:
result = chain(get_url.s(line[:-1]),do_work.s(line[:-1])).apply_async()
except:
print ("failed to submit job")
print('task submitted ' + str(line[:-1]))
Would it be faster to split the file into chunks and run multiple instances of this code? Or what can I do? I'm using memcached as the backend, rabbitmq as the broker.

import multiprocessing
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
num_workers = 200
def worker(urls,id):
"""worker function"""
for url in urls:
print ("%s - %s" % (id,url))
result = chain(get_url.s(url),do_work.s(url)).apply_async()
return
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
jobs = []
stack = []
id = 0
for i in alldat:
if (len(stack) < len(alldat) / num_workers):
stack.append(i[:-1])
continue
else:
id = id + 1
p = multiprocessing.Process(target=worker, args=(stack,id,))
jobs.append(p)
p.start()
stack = []
for j in jobs:
j.join()

If I understand your problem correctly:
you have a list of 150M urls
you want to run get_url() then do_work() on each of the urls
so you have two issues:
going over the 150M urls
queuing the tasks
Regarding the main for loop in your code, yes you could do that faster if you use multithreading, especially if you are using multicore cpu. Your master thread could read the file and pass chunks of it to sub-threads that will be creating the celery tasks.
Check the guide and the documentation:
https://realpython.com/intro-to-python-threading/
https://docs.python.org/3/library/threading.html
And now let's imagine you have 1 worker that is receiving these tasks. The code will generate 150M new tasks that will be pushed to the queue. Each chain will be a chain of get_url(), and do_work(), the next chain will run only when do_work() finishes.
If get_url() takes a short time and do_work() takes a long time, it will be a series of quick-task, slow-task, and the total time:
t_total_per_worker = (t_get_url_average+t_do_work_average) X 150M
If you have n workers
t_total = t_total_per_worker/n
t_total = (t_get_url_average+t_do_work_average) X 150M / n
Now if get_url() is time critical while do_work() is not, then, if you can, you should run all 150M get_url() first and when that is done run all 150M do_work(), but that may require changes to your process design.
That is what I would do. Maybe others have better ideas!?

Related

how to "poll" python multiprocess pool apply_async

I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.

python multiprocessing to create an excel file with multiple sheets [duplicate]

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

Multiprocessing hangs after several hundred jobs

I am trying to use this question for my file processing:
Python multiprocessing safely writing to a file
This is my modification of the code:
def listener(q):
'''listens for messages on the q, writes to file. '''
while 1:
reads = q.get()
if reads == 'kill':
#f.write('killed')
break
for read in reads:
out_bam.write(read)
out_bam.flush()
out_bam.close()
def fetch_reads(line, q):
parts = line[:-1].split('\t')
print(parts)
start,end = int(parts[1])-1,int(parts[2])-1
in_bam = pysam.AlignmentFile(args.bam, mode='rb')
fetched = in_bam.fetch(parts[0], start, end)
reads = [read for read in fetched if (read.cigarstring and read.pos >= start and read.pos < end and 'S' not in read.cigarstring)]
in_bam.close()
q.put(reads)
return reads
#must use Manager queue here, or will not work
manager = mp.Manager()
q = manager.Queue()
if not args.threads:
threads = 1
else:
threads = int(args.threads)
pool = mp.Pool(threads+1)
#put listener to work first
watcher = pool.apply_async(listener, (q,))
with open(args.bed,'r') as bed:
jobs = []
cnt = 0
for line in bed:
# Fire off the read fetchings
job = pool.apply_async(fetch_reads, (line, q))
jobs.append(job)
cnt += 1
if cnt > 10000:
break
# collect results from the workers through the pool result queue
for job in jobs:
job.get()
print('get')
#now we are done, kill the listener
q.put('kill')
pool.close()
The differences in that I am opening and closing the file in the function since otherwise I get unusual errors from bgzip.
At first, print(parts) and print('get') are interchangeably printed (more or less), then there are less and less prints of 'get'. Ultimately the code hangs, and nothing is printed (all the parts are printed, but 'get' simply doesn't print anymore). The output file remains zero bytes.
Can anyone lend a hand? Cheers!

Multiprocessing in Python not calling the worker functions

I'm fairly new to multiprocessing and I have written the script below, but the methods are not getting called. I dont understand what I'm missing.
What I want to do is the following:
call two different methods asynchronously.
call one method before the other.
# import all necessary modules
import Queue
import logging
import multiprocessing
import time, sys
import signal
debug = True
def init_worker():
signal.signal(signal.SIGINT, signal.SIG_IGN)
research_name_id = {}
ids = [55, 125, 428, 429, 430, 895, 572, 126, 833, 502, 404]
# declare all the static variables
num_threads = 2 # number of parallel threads
minDelay = 3 # minimum delay
maxDelay = 7 # maximum delay
# declare an empty queue which will hold the publication ids
queue = Queue.Queue(0)
proxies = []
#print (proxies)
def split(a, n):
"""Function to split data evenly among threads"""
k, m = len(a) / n, len(a) % n
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)]
for i in xrange(n))
def run_worker(
i,
data,
queue,
research_name_id,
proxies,
debug,
minDelay,
maxDelay):
""" Function to pull out all publication links from nist
data - research ids pulled using a different script
queue - add the publication urls to the list
research_name_id - dictionary with research id as key and name as value
proxies - scraped proxies
"""
print 'getLinks', i
for d in data:
print d
queue.put(d)
def fun_worker(i, queue, proxies, debug, minDelay, maxDelay):
print 'publicationData', i
try:
print queue.pop()
except:
pass
def main():
print "Initializing workers"
pool = multiprocessing.Pool(num_threads, init_worker)
distributed_ids = list(split(list(ids), num_threads))
for i in range(num_threads):
data_thread = distributed_ids[i]
print data_thread
pool.apply_async(run_worker, args=(i + 1,
data_thread,
queue,
research_name_id,
proxies,
debug,
minDelay,
maxDelay,
))
pool.apply_async(fun_worker,
args=(
i + 1,
queue,
proxies,
debug,
minDelay,
maxDelay,
))
try:
print "Waiting 10 seconds"
time.sleep(10)
except KeyboardInterrupt:
print "Caught KeyboardInterrupt, terminating workers"
pool.terminate()
pool.join()
else:
print "Quitting normally"
pool.close()
pool.join()
if __name__ == "__main__":
main()
The only output that I get is
Initializing workers
[55, 125, 428, 429, 430, 895]
[572, 126, 833, 502, 404]
Waiting 10 seconds
Quitting normally
There are a couple of issues:
You're not using multiprocessing.Queue
If you want to share a queue with a subprocess via apply_async etc, you need to use a manager (see example).
However, you should take a step back and ask yourself what you are trying to do. Is apply_async is really the way to go? You have a list of items that you want to map over repeatedly, applying some long-running transformations that are compute intensive (because if they're just blocking on I/O, you might as well use threads). It seems to me that imap_unordered is actually what you want:
pool = multiprocessing.Pool(num_threads, init_worker)
links = pool.imap_unordered(run_worker1, ids)
output = pool.imap_unordered(fun_worker1, links)
run_worker1 and fun_worker1 need to be modified to take a single argument. If you need to share other data, then you should pass it in the initializer instead of passing it to the subprocesses over and over again.

Redis still fills up when results_ttl=0, Why?

Question: Why is redis filling up if the results of jobs are discarded immediately?
I'm using redis as a queue to create PDFs asynchronously and then save the result to my database. Since its saved, I don't need to access the object a later date and so I don't need to keep store the result in Redis after its been processed.
To keep the result from staying in redis I've set the TTL to 0:
parameter_dict = {
"order": serializer.object,
"photo": base64_image,
"result_ttl": 0
}
django_rq.enqueue(procces_template, **parameter_dict)
The problem is although the redis worker says the job expires immediately:
15:33:35 Job OK, result = John Doe's nail order to 568 Broadway
15:33:35 Result discarded immediately.
15:33:35
15:33:35 *** Listening on high, default, low...
Redis still fills up and throws:
ResponseError: command not allowed when used memory > 'maxmemory'
Is there another parameter that I need to set in redis / django-rq to keep redis from filling up if the job result is already not stored?
Update:
Following this post I expect the memory might be filling up because of the failed jobs in redis.
Using this code snippet:
def print_redis_failed_queue():
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
print job
here is a paste bin of a dump of the keys in redis:
http://pastebin.com/Bc4bRyRR
Its too long to be pragmatic to post here. Its size seems to support my theory. But using:
def delete_redis_failed_queue():
q = django_rq.get_failed_queue()
count = 0
while True:
job = q.dequeue()
if not job:
print "{} Jobs deleted.".format(count)
break
job.delete()
count += 1
Doest clear redis like i expect. How can I get a more accurate dump of the keys in redis? Am I clearing the jobs correctly?
It turns out Redis was filling up because of orphaned jobs, ie. jobs that were not assigned to a particular queue.
Although the cause of the orphaned jobs is unknown, the problem is solved with this snippet:
import redis
from rq.queue import Queue, get_failed_queue
from rq.job import Job
redis = Redis()
for i, key in enumerate(self.redis.keys('rq:job:*')):
job_number = key.split("rq:job:")[1]
job = Job.fetch(job_number, connection=self.redis)
job.delete()
In my particular situation, calling this snippet, (actually the delete_orphaned_jobs() method below ), after the competition of each job ensured that Redis would not fill up, and that orphaned jobs would be taken care of. For more details on the issue, here's a link to the conversation in the opened django-rq issue.
In the process of diagnosing this issue, I also created a utility class for inspecting and deleting jobs / orphaned jobs with ease:
class RedisTools:
'''
A set of utility tools for interacting with a redis cache
'''
def __init__(self):
self._queues = ["default", "high", "low", "failed"]
self.get_redis_connection()
def get_redis_connection(self):
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
self.redis = redis.from_url(redis_url)
def get_queues(self):
return self._queues
def get_queue_count(self, queue):
return Queue(name=queue, connection=self.redis).count
def msg_print_log(self, msg):
print msg
logger.info(msg)
def get_key_count(self):
return len(self.redis.keys('rq:job:*'))
def get_queue_job_counts(self):
queues = self.get_queues()
queue_counts = [self.get_queue_count(queue) for queue in queues]
return zip(queues, queue_counts)
def has_orphanes(self):
job_count = sum([count[1] for count in self.get_queue_job_counts()])
return job_count < self.get_key_count()
def print_failed_jobs(self):
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
print job
def print_job_counts(self):
for queue in self.get_queue_job_counts():
print "{:.<20}{}".format(queue[0], queue[1])
print "{:.<20}{}".format('Redis Keys:', self.get_key_count())
def delete_failed_jobs(self):
q = django_rq.get_failed_queue()
count = 0
while True:
job = q.dequeue()
if not job:
self.msg_print_log("{} Jobs deleted.".format(count))
break
job.delete()
count += 1
def delete_orphaned_jobs(self):
if not self.has_orphanes():
return self.msg_print_log("No orphan jobs to delete.")
for i, key in enumerate(self.redis.keys('rq:job:*')):
job_number = key.split("rq:job:")[1]
job = Job.fetch(job_number, connection=self.redis)
job.delete()
self.msg_print_log("[{}] Deleted job {}.".format(i, job_number))
You can use the "Black Hole" exception handler from http://python-rq.org/docs/exceptions/ with job.cancel():
def black_hole(job, *exc_info):
# Delete the job hash on redis, otherwise it will stay on the queue forever
job.cancel()
return False

Categories

Resources