Fastest way to process large files in Python

Fastest way to process large files in Python - python

We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.
We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.

Will starting multiple Python processes to run the script take advantage of the other cores?
Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.
That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.
I've taken a look at the multiprocessing library but not sure how I can utilize it.
Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then
import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))
will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.
Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.

Starting independent Python processes is ideal. There will be no lock contentions between the processes, and the OS will schedule them to run concurrently.
You may want to experiment to see what the ideal number of instances is - it may be more or less than the number of cores. There will be contention for the disk and cache memory, but on the other hand you may get one process to run while another is waiting for I/O.

You can use pool of multiprocessing to create processes for increasing performance. Let's say, you have a function handle_file which is for processing image. If you use iteration, it can only use at most 100% of one your core. To utilize multiple cores, Pool multiprocessing creates subprocesses for you, and it distributes your task to them. Here is an example:
import os
import multiprocessing
def handle_file(path):
print 'Do something to handle file ...', path
def run_multiprocess():
tasks = []
for filename in os.listdir('.'):
tasks.append(filename)
print 'Create task', filename
pool = multiprocessing.Pool(8)
result = all(list(pool.imap_unordered(handle_file, tasks)))
print 'Finished, result=', result
def run_one_process():
for filename in os.listdir('.'):
handle_file(filename)
if __name__ == '__main__':
run_one_process
run_multiprocess()
The run_one_process is single core way to process data, simple, but slow. On the other hand, run_multiprocess creates 8 worker processes, and distribute tasks to them. It would be about 8 times faster if you have 8 cores. I suggest you set the worker number to double of your cores or exactly the number of your cores. You can try it and see which configuration is faster.
For advanced distributed computing, you can use ZeroMQ as larsmans mentioned. It's hard to understand at first. But once you understand it, you can design a very efficient distributed system to process your data. In your case, I think one REQ with multiple REP would be good enough.
Hope this would be helpful.

See the answer to this question.
If the app can process ranges of input data, then you can launch 4
instances of the app with different ranges of input data to process
and the combine the results after they are all done.
Even though that question looks to be Windows specific, it applies to single threaded programs on all operating system.
WARNING: Beware that this process will be I/O bound and too much concurrent access to your hard drive will actually cause the processes as a group to execute slower than sequential processing because of contention for the I/O resource.

If you are reading a large number of files and saving metadata to a database you program does not need more cores.
Your process is likely IO bound not CPU bound. Using twisted with proper defereds and callbacks would likely outperform any solution that sought to enlist 4 cores.

I think in this scenario it would make perfectly sense to use Celery.

Related

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!

Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

Multiprocessing in python with an unknown number of processors

This is probably a simple question, but after reading through documentation, blogs, and googling for a couple days, I haven't found a straightforward answer.
When using the multiprocessing module (https://docs.python.org/3/library/multiprocessing.html) in python, does the module distribute the work evenly between the number of given processors/cores?
More specifically, if I am doing development work on my local machine with four processors, and I write a function that uses multiprocessing to execute six functions, do three or four of them run in parallel and then the others run after something has finished? And, when I deploy it to production with six processors, do all six of those run in parallel?
I am trying to understand how much I need to direct the multiprocessing library. I have seen no direction in code samples, so I am assuming its handled. I want to be sure I can safely use this in multiple environments.
EDIT
After some comments, I wanted to clarify. I may be misunderstanding something.
I have several different functions I want to run at the same time. I want each of those functions to run on its own core. Speed is very important. My question is: "If I have five functions, and only four cores, how is this handled?"
Thank you.

The short answer is, if you don't specify a number of processes the default will be to spawn as many processes as your machine has cores, as indicated by multiprocessing.cpu_count().
The long answer is that it depends on how you are creating the subprocesses...
If you create a Pool object and then use that with a map or starmap or similar function, that will create "cpu_count" number of processes as described above. Or you can use the processes argument to specify a different number of subprocesses to spawn. The map function will then distribute the work to those processes.
with multiprocessing.Pool(processes=N) as pool:
rets = pool.map(func, args)
How the work is distributed by the map function can be a little complicated and you're best off reading the docs in detail if you're performance driven enough that you really care about chunking etc etc.
There are also other libraries that can help manage parallel processing at a higher level and have lots of options, such as Joblib and parmap. Again, best to read the docs.
If you specifically want to launch a number of processes equal to the number of jobs you have and don't care that it might be more than the number of cpus in the machine. You can use the Process object instead of the Pool object. This interface parallels the way the threading library can be used for concurrency.
i.e.
jobs = []
for _ in range(num_jobs):
job = multiprocessing.Process(target=func, args=args)
job.start()
jobs.append(job)
# wait for them all to finish
for job in jobs:
job.join()
Consider the above example pseudocode. You won't be able to copy paste that and expect it to work. Unless you're launching multiple instances of the same function with the same arguments of course.

Why python does (not) use more CPUs? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Write into mongodb after processing files

Say, I have 10,000 files and I'm going to use python code to extract some information from these files and write these information into mongodb. All of these operations will be done in a single local computer.
Will I be able to use python's multiprocessing package to do all the operations in parallel to speed things up? Is there any better way to do it?

Yes, multiprocessing will help distribute the workload. If you have, say, 8 cores on your machine, then roughly 8 Python processes will probably maximize throughput. More than that, and contention on disk I/O, or competition with the MongoDB server for CPU time, will become a bottleneck. The "multiprocessing.Pool.map" function is particularly elegant for dividing work among processes:
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

Does Python support multithreading? Can it speed up execution time?

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.