I have a list dataframe_chunk which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files being written sequentially and I'm not sure why this is the case. Here's the code:
import concurrent.futures as cfu
def write_chunk_to_file(chunk, fpath):
chunk.to_csv(fpath, sep=',', header=False, index=False)
pool = cfu.ThreadPoolExecutor(N_CORES)
futures = []
for i in range(N_CORES):
fpath = '/path_to_files_'+str(i)+'.csv'
futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))
for f in cfu.as_completed(futures):
print("finished at ",time.time())
Any clues?
One thing that is stated in the Python 2.7.x threading docs
but not in the 3.x docs is that
Python cannot achieve true parallelism using the threading library - only one thread will execute at a time.
You should try using concurrent.futures with the ProcessPoolExecutor which uses separate processes for each job and therefore can achieve true parallelism on a multi-core CPU.
Update
Here is your program adapted to use the multiprocessing library instead:
#!/usr/bin/env python3
from multiprocessing import Process
import os
import time
N_CORES = 8
def write_chunk_to_file(chunk, fpath):
with open(fpath, "w") as f:
for x in range(10000000):
f.write(str(x))
futures = []
print("my pid:", os.getpid())
input("Hit return to start:")
start = time.time()
print("Started at:", start)
for i in range(N_CORES):
fpath = './tmp/file-'+str(i)+'.csv'
p = Process(target=write_chunk_to_file, args=(i,fpath))
futures.append(p)
for p in futures:
p.start()
print("All jobs started.")
for p in futures:
p.join()
print("All jobs finished at ",time.time())
You can monitor the jobs with this shell command in another window:
while true; do clear; pstree 12345; ls -l tmp; sleep 1; done
(Replace 12345 with the pid emitted by the script.)
Your code probably works, it starts making the 2nd+ file(s) while the 1st chunk is being written, etc. It will be slightly faster than the simple synchronous version because the syscalls follow each other sooner.
But from the kernel perspective the IO syscalls will still come in one after another from a single python process, so the files will be created in serial, albeit at a faster frequency.
Related
This question already has answers here:
Is there a simple process-based parallel map for python?
(5 answers)
Closed 4 years ago.
I wrote code like this:
def process(data):
#create file using data
all = ["data1", "data2", "data3"]
I want to execute process function on my all list in parallel, because they are creating small files so I am not concerned about disk write but the processing takes long, so I want to use all of my cores.
How can I do this using default modules in python 2.7?
Assuming CPython and the GIL here.
If your task is I/O bound, in general, threading may be more efficient since the threads are simply dumping work on the operating system and idling until the I/O operation finishes. Spawning processes is a heavy way to babysit I/O.
However, most file systems aren't concurrent, so using multithreading or multiprocessing may not be any faster than synchronous writes.
Nonetheless, here's a contrived example of multiprocessing.Pool.map which may help with your CPU-bound work:
from multiprocessing import cpu_count, Pool
def process(data):
# best to do heavy CPU-bound work here...
# file write for demonstration
with open("%s.txt" % data, "w") as f:
f.write(data)
# example of returning a result to the map
return data.upper()
tasks = ["data1", "data2", "data3"]
pool = Pool(cpu_count() - 1)
print(pool.map(process, tasks))
A similar setup for threading can be found in concurrent.futures.ThreadPoolExecutor.
As an aside, all is a builtin function and isn't a great variable name choice.
Or:
from threading import Thread
def process(data):
print("processing {}".format(data))
l= ["data1", "data2", "data3"]
for task in l:
t = Thread(target=process, args=(task,))
t.start()
Or (only python version > 3.6.0):
from threading import Thread
def process(data):
print(f"processing {data}")
l= ["data1", "data2", "data3"]
for task in l:
t = Thread(target=process, args=(task,))
t.start()
There is a template of using multiprocessing, hope helpful.
from multiprocessing.dummy import Pool as ThreadPool
def process(data):
print("processing {}".format(data))
alldata = ["data1", "data2", "data3"]
pool = ThreadPool()
results = pool.map(process, alldata)
pool.close()
pool.join()
I have a code that works with Thread in python, but I wanna switch to Process as if I have understood well that will give me a speed-up.
Here there is the code with Thread:
threads.append(Thread(target=getId, args=(my_queue, read)))
threads.append(Thread(target=getLatitude, args=(my_queue, read)))
The code works putting the return in the Queue and after a join on the threads list, I can retrieve the results.
Changing the code and the import statement my code now is like that:
threads.append(Process(target=getId, args=(my_queue, read)))
threads.append(Process(target=getLatitude, args=(my_queue, read)))
However it does not execute anything and the Queue is empty, with the Thread the Queue is not empty so I think it is related to Process.
I have read answers in which the Process class does not work on Windows is it true, or there is a way to make it work (adding freeze_support() does not help)?
In the negative case, multithreading on windows is actually executed in parallel on different cores?
ref:
Python multiprocessing example not working
Python code with multiprocessing does not work on Windows
Multiprocessing process does not join when putting complex dictionary in return queue
(in which is described that fork does not exist on Windows)
EDIT:
To add some details:
the code with Process is actually working on centOS.
EDIT2:
add a simplified version of my code with processes, code tested on centOS
import pandas as pd
from multiprocessing import Process, freeze_support
from multiprocessing import Queue
#%% Global variables
datasets = []
latitude = []
def fun(key, job):
global latitude
if(key == 'LAT'):
latitude.append(job)
def getLatitude(out_queue, skip = None):
latDict = {'LAT' : latitude}
out_queue.put(latDict)
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
print("Number of baboon:" + str(n))
read = []
for i in range(0,n):
threads = []
my_queue = Queue()
threads.append(Process(target=getLatitude, args=(my_queue, read)))
for t in threads:
freeze_support() # try both with and without this line
t.start()
for t in threads:
t.join()
while not my_queue.empty():
try:
job = my_queue.get()
key = list(job.keys())
fun(key[0],job[key[0]])
except:
print("END")
read.append(i)
Per the documentation, you need the following after the function definitions. When Python creates the subprocesses, they import your script so the code that runs at the global level will be run multiple times. For the code you only want to run in the main thread:
if __name__ == '__main__':
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
# etc.
Indent the rest of code under this if.
I'm trying to execute a function on every line of a CSV file as fast as possible. My code works, but I know it could be faster if I make better use of the multiprocessing library.
processes = []
def execute_task(task_details):
#work is done here, may take 1 second, may take 10
#send output to another function
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row in r:
p = Process(target=execute_task, args=(row,))
processes.append(p)
p.start()
for p in processes:
p.join()
I'm thinking I should put the tasks into a Queue and process them with a Pool but all the examples make it seem like Queue doesn't work the way I assume, and that I can't map a Pool to an ever expanding Queue.
I've done something similar using a Pool of workers.
from multiprocessing import Pool, cpu_count
def initializer(arg1, arg2):
# Do something to initialize (if necessary)
def process_csv_data(data):
# Do something with the data
pool = Pool(cpu_count(), initializer=initializer, initargs=(arg1, arg2))
with open("csv_data_file.csv", "rb") as f:
csv_obj = csv.reader(f)
for row in csv_obj:
pool.apply_async(process_csv_data, (row,))
However, as pvg commented under your question, you might want to consider how to batch your data. Going row by row may not the the right level of granularity.
You might also want to profile/test to figure out the bottle-neck. For example, if disk access is limiting you, you might not benefit from parallelizing.
mulprocessing.Queue is a means to exchanging objects among the processes, so it's not something you'd put a task into.
For me it looks like you are actually trying to speed up
def check(row):
# do the checking
return (row,result_of_check)
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row,result in map(check,r):
print(row,result)
which can be done with
#from multiprocessing import Pool # if CPU-bound (but even then not alwys)
from multiprocessing.dummy import Pool # if IO-bound
def check(row):
# do the checking
return (row,result_of_check)
if __name__=="__main__": #in case you are using processes on windows
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
with Pool() as p: # before python 3.3 you should do close() and join() explicitly
for row,result in p.imap_unordered(check,r, chunksize=10): # just a quess - you have to guess/experiement a bit to find the best value
print(row,result)
Creating processes takes some time (especially on windows) so in most cases using threads via multiprocessing.dummy is faster (and also multiprocessing is not totally trivial - see Guidelines).
I have a python script which is something like that:
def test_run():
global files_dir
for f1 in os.listdir(files_dir):
for f2 os.listdir(files_dir):
os.system("run program x on f1 and f2")
what is the best way to call each of the os.system calls on different processor? using subprocess or multiprocessing pool?
NOTE : each run of the program will generate an output file.
#unutbu's answer is fine, but there's a less disruptive way to do it: use a Pool to pass out tasks. Then you don't have to muck with your own queues. For example,
import os
NUM_CPUS = None # defaults to all available
def worker(f1, f2):
os.system("run program x on f1 and f2")
def test_run(pool):
filelist = os.listdir(files_dir)
for f1 in filelist:
for f2 in filelist:
pool.apply_async(worker, args=(f1, f2))
if __name__ == "__main__":
import multiprocessing as mp
pool = mp.Pool(NUM_CPUS)
test_run(pool)
pool.close()
pool.join()
That "looks a lot more like" the code you started with. Not that this is necessarily a good thing ;-)
In a recent version of Python 3, Pool objects can also be used as context managers, so the tail end could be reduced to:
if __name__ == "__main__":
import multiprocessing as mp
with mp.Pool(NUM_CPUS) as pool:
test_run(pool)
EDIT: using concurrent.futures instead
For very simple tasks like this, Python 3's concurrent.futures can be easier to use. Replace the code in the above, from test_run() on down, like so:
def test_run():
import concurrent.futures as cf
filelist = os.listdir(files_dir)
with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
for f1 in filelist:
for f2 in filelist:
pp.submit(worker, f1, f2)
if __name__ == "__main__":
test_run()
It needs to be fancier if you don't want exceptions in worker processes to vanish silently. That's a potential problem with all parallelism gimmicks. The problem is that there's usually no good way to raise exceptions in the main program, since they occur in contexts (worker processes) that may have nothing to do with what the main program is doing at the time. One way to get the exceptions (re)raised in the main program is to explicitly ask for the results; for example, change the above to:
def test_run():
import concurrent.futures as cf
filelist = os.listdir(files_dir)
futures = []
with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
for f1 in filelist:
for f2 in filelist:
futures.append(pp.submit(worker, f1, f2))
for future in cf.as_completed(futures):
future.result()
Then if an exception occurs in a worker process, the future.result() will re-raise that exception in the main program when it's applied to the Future object that represents the failing inter-process call.
Probably more than you wanted to know at this point ;-)
You could use a mixture of subprocess and multiprocessing.
Why both? If you just use subprocess naively, you would spawn as many subprocesses as there are tasks. You could easily have thousands of tasks, and spawning that many subprocesses all at once may bring your machine to its knees.
So you could instead use multiprocessing to spawn only as many worker processes as your machine has CPUs (mp.cpu_count()). Each worker process could then read tasks (pairs of filenames) from a Queue, and spawn a subprocess. The worker should then wait until the subprocess completes before processing another task from the Queue.
import multiprocessing as mp
import itertools as IT
import subprocess
SENTINEL = None
def worker(queue):
# read items from the queue and spawn subproceses
# The for-loop ends when queue.get() returns SENTINEL
for f1, f2 in iter(queue.get, SENTINEL):
proc = subprocess.Popen(['prog', f1, f2])
proc.communicate()
def test_run(files_dir):
# avoid globals when possible. Pass files_dir as an argument to the function
# global files_dir
queue = mp.Queue()
# Setup worker processes. The workers will all read from the same queue.
procs = [mp.Process(target=worker, args=[queue]) for i in mp.cpu_count()]
for p in procs:
p.start()
# put items (tasks) in the queue
files = os.listdir(files_dir)
for f1, f2 in IT.product(files, repeat=2):
queue.put((f1, f2))
# Put sentinels in the queue to signal the worker processes to end
for p in procs:
queue.put(SENTINEL)
for p in procs:
p.join()
I have a python script to run a few external commands using the os.subprocess module. But one of these steps takes a huge time and so I would like to run it separately. I need to launch them, check they are finished and then execute the next command which is not parallel.
My code is something like this:
nproc = 24
for i in xrange(nproc):
#Run program in parallel
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
for i in xrange(nproc):
for zline in open('Niben_%s_file%d_structures' % (zfile_name,i)):handle.write(zline)
handle.close()
#Run next step
cmd = 'bowtie-build -f Niben_%s_precursors.fa bowtie-index/Niben_%s_precursors' % (zfile_name,zfile_name)
For your example, you just want to shell out in parallel - you don't need threads for that.
Use the Popen constructor in the subprocess module: http://docs.python.org/library/subprocess.htm
Collect the Popen instances for each process you spawned and then wait() for them to finish:
procs = []
for i in xrange(nproc):
procs.append(subprocess.Popen(ARGS_GO_HERE)) #Run program in parallel
for p in procs:
p.wait()
You can get away with this (as opposed to using the multiprocessing or threading modules), since you aren't really interested in having these interoperate - you just want the os to run them in parallel and be sure they are all finished when you go to combine the results...
Running things in parallel can also be implemented using multiple processes in Python. I had written a blog post on this topic a while ago, you can find it here
http://multicodecjukebox.blogspot.de/2010/11/parallelizing-multiprocessing-commands.html
Basically, the idea is to use "worker processes" which independently retrieve jobs from a queue and then complete these jobs.
Works quite well in my experience.
You can do it using threads. This is very short and (not tested) example with very ugly if-else on what you are actually doing in the thread, but you can write you own worker classes..
import threading
class Worker(threading.Thread):
def __init__(self, i):
self._i = i
super(threading.Thread,self).__init__()
def run(self):
if self._i == 1:
self.result = do_this()
elif self._i == 2:
self.result = do_that()
threads = []
nproc = 24
for i in xrange(nproc):
#Run program in parallel
w = Worker(i)
threads.append(w)
w.start()
w.join()
# ...now all threads are done
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
...etc...