Parallelising / scheduling python function call on many files - python

I have a few hundred thousand csv files I would all like to apply the same function to. Something like the following dummy function:
def process_single_file(fname):
df = pd.read_csv(fname)
# Pandas and non-pandas processing
df.to_csv(f"./output/{fname}")
As looping over all files individually would take too long, my question is what the most efficient way to schedule and parallelize this execution – no processes are dependent on each other. I started off trying to use python's multiprocessing:
import multiprocessing
files = sorted(glob.glob("./input/*.csv"))
processes = []
for fname in files:
p = multiprocessing.Process(target=process_file, args=(fname,))
processes.append(p)
p.start()
for process in processes:
process.join()
My computer, however, doesn't seem to like this process as it quickly overloads all CPU's and leading to slow-downs and crashes. Is there a more efficient way to reduce the workload of all CPU's and schedule the tasks such as using Dask, some Bash script or changing python? Thanks in advance.

It really depends on where your bottleneck is : are you spending more time reading / writing files, or doing CPU processing stuff ?
This RealPython tutorial really helped me a lot learning about all this stuff, I can only recommend a good read ;)
As explained in the tutorial, if I/O, multithreading is enough (and possibly better than multiprocessing):
def process_all_files(files):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(process_single_file, files)
And if CPU, multiprocessing will let you use all your available cores:
def process_all_files(files):
with multiprocessing.Pool() as pool:
pool.map(process_single_file, files)

You can try Ray, it is a quite efficient module to parallelize tasks

Absolutely pool is the way to go.
Something along the below lines
`from multiprocessing import Pool
def f(x):
return x*x
if name == 'main':
pool = Pool(processes=4) `
check the following post
Using multiprocessing.Process with a maximum number of simultaneous processes

Related

Python call multiple functions in parallel and combine result

I have to write a job to perform difference type of analysis on given document. I know I can do sequentially i.e., call each parser one by one.
A very high level script structure is given below
def summarize(doc):
pass
def LengthCount(doc):
pass
def LanguageFinder(doc):
pass
def ProfanityFinder(doc):
pass
if __name__ == '__main__':
doc = "Some document"
smry = summarize(doc)
length = LengthCount(doc)
lang = LanguageFinder(doc)
profanity = ProfanityFinder(doc)
# Save sumary, length, language, profanity information in database
But for performance improvement, I think these task can be run in parallel. How can I do it. What are the possible ways for this purpose in Python especially 3.x version. It is quite possible that one parser (module) take more time than other but overall if they could be run in parallel they it will increase performance. Lastly, if not possible in Python, any other language is also welcome.
In Python you have a few options for concurrency/parallelism. There is the threading module which allows you to execute code in multiple logical threads and the multiprocessing module which allows you to spawn multiple processes. There is also the concurrent.futures module that provides an API into both of these mechanisms.
If your process is CPU-bound (i.e. you are running at 100% of the CPU available to Python throughout - note this is not 100% CPU if you have a multi-core or hyper-threading machine) you are unlikely to see much benefit from threading as this doesn't actually use multiple CPU threads in parallel, it just allows one to take over from another whilst the first is waiting for IO. Multiprocessing is likely to be more useful for you as this allows you to run using multiple CPU threads. You can start each of your functions in its own process using the Process class:
import multiprocessing
#function defs here
p = multiprocessing.Process(target=LengthCount, args=(doc,))
p.start()
# repeat for other processes
You will need to tweak your code to have the functions return to a shared variable (or write straight to your database) rather than directly return your result so you can access them once the process is complete.

Pandas DataFrame Multithreading No Performance Gain

I have a dictionary (in memory) data that has ~ 10,000 keys which each key represent a stock ticker, and the value stores the pandas dataframe representation of time series data for daily stock price. I am trying to calculate the pairwise Pearson correlation.
The code takes a long time ~3 hr to fully iterate through all the combinations O(n^2) ~ C(2, 10000). I tried to use multiprocessing dummy package but saw no performance gain AT ALL (actually slower as the number of workers increases).
from multiprocessing.dummy import Pool
def calculate_correlation((t1, t2)):
# pseudo code here
return pearsonr(data[t1]['Close'], data[t2]['Close'])
todos = []
for idx, t1 in enumerate(list(data.keys())):
for t2 in list(data.keys())[idx:]: # only the matrix top triangle
todos.append((t1, t2))
pool = Pool(4)
results = pool.map(calculate_correlation, todos)
pool.close()
pool.join()
All the data has been loaded into memory so it should not be IO intensive. Is there any reason that why there is no performance gain at all?
When you use multiprocessing.dummy, you're using threads, not processes. For a CPU-bound application in Python, you are usually not going to get performance boost when using multi-threading. You should use multi-processing instead to parallelize your code in Python. So, if you change your code from
from multiprocessing.dummy import Pool
to
from multiprocessing import Pool
This should substantially improve your performance.
The above will fix your problem, but if you want to know why this happened. Please continue reading:
Multi-threading in Python has Global Interpreter Lock (GIL) that prevents two threads in the same process to run at the same time. If you had a a lot of disk IO happening, multi-threading would have helped because DISK IO is separate process that can handle locks. Or, if you had a separate application used by your Python code that can handle locks, multi-threading would have helped. Multi-processing, on the other hand, will use all the cores of your CPU as separate processes as opposed to multi-threading. In CPU bound Python application such as yours, if you use multi-processing instead of multi-threading, your application will run on multiple processes on several cores in parallel which will boost the performance of your application.

Python Multiprocessing: Detecting available threads on multicore processors

At risk of adding the the queue of multiprocessing questions, is there a way to detect the number of available threads per CPU similar to the multiprocessing.cpu_count()? I have a main() function being asynchronously called from a pool that has one process per core available (default behavior if processes=None).
items = [2,4,6,8,10]
pool = multiprocessing.Pool(processes=args.cores)
results = [pool.apply_async(main, (item,), {threads=1}) for item in items]
However, I would like each of the main() calls to take advantage of all available threads by setting the threads arg explicitly.
Is there a way to do this, or would it be too platform/system specific? Perhaps it could be done with the mysterious from multiprocessing.pool import ThreadPool, by combining with the threading module, or some other way?
Any direction is appreciated, thanks!

Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

I have some string processing job in Python. And I wish to speed up the job
by using a thread pool. The string processing job has no dependency to each
other. The result will be stored into a mongodb database.
I wrote my code as follow:
thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
pool.apply_async(_process, [single_string ])
pool.close()
pool.join()
def _process(s):
# Do staff, pure python string manipulation.
# Save the output to a database (pyMongo).
I try to run the code in a Linux machine with 8 CPU cores. And it turns out
that the maximum CPU usage can only be around 130% (read from top), when I
run the job for a few minutes.
Is my approach correct to use a thread pool? Is there any better way to do so?
You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.
Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:
def _process(s):
for i in xrange(100000000):
j = i * i

how to use multiprocessing.Pool in python

I need to run the same function based on the same data a lot of times.
For this I am using multiprocessing.Pool in order to speedup the computation.
from multiprocessing import Pool
import numpy as np
x=np.array([1,2,3,4,5])
def func(x): #this should be a function that takes 3 minutes
m=mean(x)
return(m)
p=Pool(100)
mapper=p.map(multiple_cv,[x]*500)
The program works well but at the end I have 100 python processes opened and all my system starts to go very slow.
How can I solve this?
Am
I using Pool in the wrong way? Should I use another function?
EDIT: using p = Pool(multiprocessing.cpu_count()) will my PC use 100% of it's power?
Or there is something else I should use?
In addition to limiting yourself to
p = Pool(multiprocessing.cpu_count())
I believe you want to do the following when you're finished as well...
p.close()
This should close out the process after it's completed.
As a general rule, you don't want too many more pools than you have CPU cores, because your computer won't be able to parallelize the work beyond the number of cores available to actually do the processing. It doesn't matter if you've got 100 processes when your CPU can only process four thing simultaneously. A common practice is to do this
p = Pool(multiprocessing.cpu_count())

Categories

Resources