Read a file and do something, multithreading - python

This source is just an exemple:
inputf = open('input', 'r')
outputf = open('output', 'a')
for x in inputf:
x = x.strip('\n')
result = urllib2.urlopen('http://test.com/'+x).getcode()
outputf.write(x+' - '+result+'\n')
I want to add threading to this to check a few URLs at the same time.
The user should everytime decide how many threads he want to use.
The order on the output is not important.
What is the best and most beautiful way for that?

I like multiprocessing.pool.ThreadPool (or multiprocessing.pool.Pool)
like:
from multiprocessing.pool import ThreadPool
n_threads = 5
pool = ThreadPool(processes=n_threads)
threads = [pool.apply_async(some_function, args=(arg1,)) for arg1 in args]
pool.close()
pool.join()
results = [result.get() for result in threads]

Related

python multiprocessing to create an excel file with multiple sheets [duplicate]

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

How Do I Send Multiple Threads In Python?

I want to be able to receive a list of content when posting data from an input. The input is in a text file which will be opened in python. To speed the process up I'd like to increase the number of threads that can be sent at once. How would I be able to do that, here's a rough idea of what I'm talking about:
import requests
userdata = open("data.txt", "r")
usercodes = [x.strip() for x in userdata]
for i in range(len(usercodes)):
thread_one = requests.post(url='https://test.com/input', params=usercodes[i])
thread_two = requests.post(url='https://test.com/input', params=usercodes[i+1])
thread_three = requests.post(url='https://test.com/input', params=usercodes[i+2])
I want all the threads to run at the same time as in here the program will carry out the requests one after the next.
import requests
from multiprocessing import Pool
def make_request(usercode):
requests.post(url='https://test.com/input', params=usercode)
if __name__ == '__main__':
userdata = open("data.txt", "r")
usercodes = [x.strip() for x in userdata]
with Pool(multiprocessing.cpu_count()) as p:
print(p.map(make_request, usercodes))
p.close()
With concurrent.futures.ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import requests
userdata = open("data.txt", "r")
usercodes = (x.strip() for x in userdata) # keep as generator
with ThreadPoolExecutor() as pool:
pool.map(partial(requests.post, 'https://test.com/input'), usercodes)
userdata.close() # closing the input file
Async is definitely your friend here.
from gevent import joinall, spawn, monkey
gevent.monkey.patch_all()
import requests
userdata = open("data.txt", "r")
usercodes = [x.strip() for x in userdata]
send_url = 'https://test.com/input'
threads = []
def send(usercode):
requests.post(url=send_url, params=usercode)
for code in usercodes:
threads.append(spawn(send, code))
joinall(threads)

Fasted way to submit tasks with celery?

I'm trying to submit around 150 million jobs to celery using the following code:
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
for line in alldat:
try:
result = chain(get_url.s(line[:-1]),do_work.s(line[:-1])).apply_async()
except:
print ("failed to submit job")
print('task submitted ' + str(line[:-1]))
Would it be faster to split the file into chunks and run multiple instances of this code? Or what can I do? I'm using memcached as the backend, rabbitmq as the broker.
import multiprocessing
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
num_workers = 200
def worker(urls,id):
"""worker function"""
for url in urls:
print ("%s - %s" % (id,url))
result = chain(get_url.s(url),do_work.s(url)).apply_async()
return
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
jobs = []
stack = []
id = 0
for i in alldat:
if (len(stack) < len(alldat) / num_workers):
stack.append(i[:-1])
continue
else:
id = id + 1
p = multiprocessing.Process(target=worker, args=(stack,id,))
jobs.append(p)
p.start()
stack = []
for j in jobs:
j.join()
If I understand your problem correctly:
you have a list of 150M urls
you want to run get_url() then do_work() on each of the urls
so you have two issues:
going over the 150M urls
queuing the tasks
Regarding the main for loop in your code, yes you could do that faster if you use multithreading, especially if you are using multicore cpu. Your master thread could read the file and pass chunks of it to sub-threads that will be creating the celery tasks.
Check the guide and the documentation:
https://realpython.com/intro-to-python-threading/
https://docs.python.org/3/library/threading.html
And now let's imagine you have 1 worker that is receiving these tasks. The code will generate 150M new tasks that will be pushed to the queue. Each chain will be a chain of get_url(), and do_work(), the next chain will run only when do_work() finishes.
If get_url() takes a short time and do_work() takes a long time, it will be a series of quick-task, slow-task, and the total time:
t_total_per_worker = (t_get_url_average+t_do_work_average) X 150M
If you have n workers
t_total = t_total_per_worker/n
t_total = (t_get_url_average+t_do_work_average) X 150M / n
Now if get_url() is time critical while do_work() is not, then, if you can, you should run all 150M get_url() first and when that is done run all 150M do_work(), but that may require changes to your process design.
That is what I would do. Maybe others have better ideas!?

Multiprocessing hangs after several hundred jobs

I am trying to use this question for my file processing:
Python multiprocessing safely writing to a file
This is my modification of the code:
def listener(q):
'''listens for messages on the q, writes to file. '''
while 1:
reads = q.get()
if reads == 'kill':
#f.write('killed')
break
for read in reads:
out_bam.write(read)
out_bam.flush()
out_bam.close()
def fetch_reads(line, q):
parts = line[:-1].split('\t')
print(parts)
start,end = int(parts[1])-1,int(parts[2])-1
in_bam = pysam.AlignmentFile(args.bam, mode='rb')
fetched = in_bam.fetch(parts[0], start, end)
reads = [read for read in fetched if (read.cigarstring and read.pos >= start and read.pos < end and 'S' not in read.cigarstring)]
in_bam.close()
q.put(reads)
return reads
#must use Manager queue here, or will not work
manager = mp.Manager()
q = manager.Queue()
if not args.threads:
threads = 1
else:
threads = int(args.threads)
pool = mp.Pool(threads+1)
#put listener to work first
watcher = pool.apply_async(listener, (q,))
with open(args.bed,'r') as bed:
jobs = []
cnt = 0
for line in bed:
# Fire off the read fetchings
job = pool.apply_async(fetch_reads, (line, q))
jobs.append(job)
cnt += 1
if cnt > 10000:
break
# collect results from the workers through the pool result queue
for job in jobs:
job.get()
print('get')
#now we are done, kill the listener
q.put('kill')
pool.close()
The differences in that I am opening and closing the file in the function since otherwise I get unusual errors from bgzip.
At first, print(parts) and print('get') are interchangeably printed (more or less), then there are less and less prints of 'get'. Ultimately the code hangs, and nothing is printed (all the parts are printed, but 'get' simply doesn't print anymore). The output file remains zero bytes.
Can anyone lend a hand? Cheers!

Python MemoryError with Queue and threading

I'm currently writing a script that reads reddit comments from a large file (5 gigs compressed, ~30 gigs of data being read). My script reads the comments, checks for some text, parses them, and sends them off to a Queue function (running in a seperate thread). No matter what I do, I always get a MemoryError on a specific iteration (number 8162735 if it matters in the slightest). And I can't seem to handle the error, Windows just keeps shutting down python when it hits. Here's my script:
import ujson
from tqdm import tqdm
import bz2
import json
import threading
import spacy
import Queue
import time
nlp = spacy.load('en')
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in (enumerate(file_)):
yield ujson.loads(line)['body']
objects = iter_comments('RC_2015-01.bz2')
q = Queue.Queue()
f = open("reddit_dump.bin", 'wb')
def worker():
while True:
item = q.get()
f.write(item)
q.task_done()
for i in range(0, 2):
t = threading.Thread(target=worker)
t.daemon = True
t.start()
def finish_parse(comment):
global q
try:
comment_parse = nlp(unicode(comment))
comment_bytes = comment_parse.to_bytes()
q.put(comment_bytes)
except MemoryError:
print "MemoryError with comment {0}, waiting for Queue to empty".format(comment)
time.sleep(2)
except AssertionError:
print "AssertionError with comment {0}, skipping".format(comment)
for comment in tqdm(objects):
comment = str(comment.encode('ascii', 'ignore'))
if ">" in comment:
c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
c_parse_thread.start()
q.join()
f.close()
Does anybody know what I'm doing wrong?
Looks like its not in your code but may be in the data. Have you tried to skip that iteration?
x = 0
for comment in tqdm(objects):
x += 1
if x != 8162735
comment = str(comment.encode('ascii', 'ignore'))
if ">" in comment:
c_parse_thread = threading.Thread(target=finish_parse, args=(comment,))
c_parse_thread.start()

Categories

Resources