How to read / process large files in parallel with Python

How to read / process large files in parallel with Python - python

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.
Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.
Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?
I'm using Python 3.6.X

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.
Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing.
If that does not help, you could try:
Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.

The main bottleneck you need to be aware of here is neither disk nor CPU, it is memory, not how much you have, but how the hardware and OS work together to pre-fetch pages from RAM into the L{1,2,3} caches.
The parallel approach will have worse performance than the serial approach if you use readline() to load one line at a time. The reason has to do with hardware+OS, not software. When the cpu requires a memory read, a certain amount of extra data is fetched in to the Lx caches of the CPU in anticipation that this data might be required later. When you employ the serial approach, this extra data is in fact used while it is still in the cache. But when parallel readlines() are happening, the extra data is preempted before there is a chance to use it. Hence it has to be fetched again later. This has a huge impact on performance.
The way to make the parallel approach beat the performance of the serial readline() approach is to have your parallel processes read more than one line at a time into memory. Use read() instead of readline(). How many bytes should you read? Approximately the size of your L1 cache, about 64K. With this, several pages of contiguous memory can be loaded into that cache.
However, if you replace serial readline() with serial read() this will outperform parallel read(). Why? Because although each core has its own L1 cache, the cores are sharing the other L caches and therefore you're running into the same problem. Process 1 will pre-fetch to populate the caches, but before it has time to process it all, Process 2 takes over and replaces the contents of the cache. Therefore Process 1 will have to fetch the same data again later.
The performance difference between serial and parallel can be easily seen:
First make sure the whole file is in the page cache (free -m: buff/cache). By doing this you are totally removing the disk/block device layer from the equation. The whole file is in RAM.
Then run a simple serial code and time it.
Then run a simple parallel code using Process() and time it. On commodity systems, your serial reads will outperform your parallel reads.
This is contrary to what you expect, right? But you have to think about the assumptions you were making. Your assumptions didn't include the fact that there are caches and memory buses being shared between multiple cores. This is precisely where the bottleneck is located. We know disk isn't the bottleneck because we have loaded the entire file into page cache in RAM. We know CPU isn't the bottleneck because we are using multiprocessing.Process() which ensures simultaneous execution, one process per core (vmstat 1, top -d 1 %1).
The only way you can have better performance from the parallel approach is to make sure that you are running hardware with separate NuMA nodes where each core has its own memory bus and its own caches.
As a side note, contrary to what other answers have claimed, Python can certainly do true computational multiprocessing, where you put a 100% utilization on each core at the same time. This is done using multiprocessing.Process().
Thread Pools will not work for this because the threads are tied to a single core and are restricted by python's GIL. But multiprocessing.Process() is not restricted by the GIL and certainly does work.
Another side note: it is bad practice to try to load your entire input file into your heap. You never know if the file is larger than can fit into RAM which will cause an oom. So don't try doing this in an attempt to optimize the performance of your code.

Related

Optimizing number of threads in Python

I am writing a program that analyses csv files in a directory, initially one file at a time. This could be several hundred files, but all of them are relatively small. My main runtime limitation was I/O, so I turned to multithreading using the threading library, which is a first for me.
I created a thread for each function call, following this guide, where each function call opens a csv in the desired directory. As a result, I have a list of threads for each file (i.e. hundreds of threads). However, my program still ran slowly, with the bulk of its time spent method 'acquire' of '_thread.lock' objects according to cProfile. I believe that this is because of the large number of threads resulting in lots of threads waiting for others to finish their tasks - is this correct?
How would you recommend I resolve this? My current idea is to split my list of files into equally sized chunks and to assign a thread to each chunk, rather than a thread to each file, and for each thread to iterate through the files in each chunk.

Python has something called the Global Interpreter Lock which seriously hurts your performance with that many threads, as each one is waiting to hold the "interpreter lock." I would recommend using Processes which if I remember are similar to Python thread objects in their use but do not suffer the same performance penalty of waiting for a lock. A thread and a process are different, but for your application, it sounds like it should not matter.
It is worth noting that the GIL can be released when performing I/O such as reading from a file, and therefore using threads might be fine - you just need to use fewer of them. In fact, with the number of threads/processes you are looking to create it might be a better idea to use a fixed pool of workers.

Multiprocessing with Multithreading? How do I make this more efficient?

I have an interesting problem on my hands. I have access to a 128 CPU ec2 instance. I need to run a program that accepts a 10 million row csv, and sends a request to a DB for each row in that csv to augment the existing data in the csv. In order to speed this up, I use:
executor = concurrent.futures.ProcessPoolExecutor(len(chunks))
futures = [executor.submit(<func_name>, chnk) for chnk in chunks]
successes = concurrent.futures.wait(futures)
I chunk up the 10 million row csv into 128 portions and then use futures to spin up 128 processes (+1 for the main one, so total 129). Each process takes a chunk, and retrieves the records for its chunk and spits the output into a file. At the end of the process, I merge all the files together and voila.
I have a few questions about this.
is this the most efficient way to do this?
by creating 128 subprocesses, am I really using the 128 CPUs of the machine?
would multithreading be better/more efficient?
can I multithread on each CPU?
advice on what to read up on?
Thanks in advance!

Is this most efficient?
Hard to tell without profiling. There's always a bottleneck somewhere. For example if you are cpu limited, and the algorithm can't be made more efficient, that's probably a hard limit. If you're storage bandwidth limited, and you're already using efficient read/write caching (typically handled by the OS or by low level drivers), that's probably a hard limit.
Are all cores of the machine actually used?
(Assuming python is running on a single physical machine, and you mean individual cores of one cpu) Yes, python's mp.Process creates a new OS level process with a single thread which is then assigned to execute for a given amount of time on a physical core by the OS's scheduler. Scheduling algorithms are typically quite good, so if you have an equal number of busy threads as logical cores, the OS will keep all the cores busy.
Would threads be better?
Not likely. Python is not thread safe, so it must only allow a single thread per process run at a time. There are specific exceptions to this when a function is written in c or c++, and calls the python macro: Py_BEGIN_ALLOW_THREADS though this is not extremely common. If most of your time is spent in such functions, threads will actually be allowed to run concurrently, and will have less overhead compared to processes. Threads also share memory, making passing results back after completion easier (threads can simply modify some global state rather than passing results via a queue or similar).
multithreading on each CPU?
Again, I think what you probably have is a single CPU with 128 cores.. The OS scheduler decides which threads should run on each core at any given time. Unless the threads are releasing the GIL, only one thread from each process can run at a time. For example running 128 processes each with 8 threads would result in 1024 threads, but still only 128 of them could ever run at a time, so the extra threads would only add overhead.
what to read up on?
When you want to make code fast, you need to be profiling. Profiling for parallel processing is more challenging, and profiling for a remote / virtualized computer can sometimes be challenging as well. It is not always obvious what is making a particular piece of code slow, and the only way to be sure is to test it. Also look into the tools you're using. I'm specifically thinking about the database you're using, because most database software has had a great deal of work put into optimization, but you must use it in the correct way to get the most speed out of it. Batched requests come to mind rather than accessing a single row at a time.

Write into mongodb after processing files

Say, I have 10,000 files and I'm going to use python code to extract some information from these files and write these information into mongodb. All of these operations will be done in a single local computer.
Will I be able to use python's multiprocessing package to do all the operations in parallel to speed things up? Is there any better way to do it?

Yes, multiprocessing will help distribute the workload. If you have, say, 8 cores on your machine, then roughly 8 Python processes will probably maximize throughput. More than that, and contention on disk I/O, or competition with the MongoDB server for CPU time, will become a bottleneck. The "multiprocessing.Pool.map" function is particularly elegant for dividing work among processes:
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

High memory usage only when multiprocessing

I am trying to use python's multiprocessing library to hopefully gain some performance. Specifically I am using its map function. Now, for some reason when I swap it out with its single processed counterpart I don't get high memory usage. But using the multiprocessing version of map causes my memory to go through the roof. For the record I am doing something which can easily hog up loads of memory, but what would the difference be between the two to cause such a stark difference?

You realize that multiprocessing does not use threads, yes? I say this because you mention a "single threaded counterpart".
Are you sending a lot of data through multiprocessing's map? A likely cause is the serialization multiprocessing has to do in many cases. multiprocessing uses pickle, which does typically take up more memory than the data it's pickling. (In some cases, specifically on systems with fork() where new processes are created when you call the map method, it can avoid the serialization, but whenever it needs to send new data to existing process it cannot do so.)
Since with multiprocessing all of the actual work is done in separate processes, the memory of your main process should not be affected by the actual operations you perform. The total use of memory does go up by quite a bit, however, because each worker process has a copy of the data you sent across. This is sometimes copy-on-write memory (in the same cases as not serializing) on systems that have CoW, but Python's use of memory is such that this quickly becomes written to, and thus copied.

python program choice

My program is ICAPServer (similar with httpserver), it's main job is to receive data from clients and save the data to DB.
There are two main steps and two threads:
ICAPServer receives data from clients, puts the data in a queue (50kb <1ms);
another thread pops data from the queue, and writes them to DB SO, if 2nd step is too slow, the queue will fill up memory with those data.
Wondering if anyone have any suggestion...

It is hard to say for sure, but perhaps using two processes instead of threads will help in this situation. Since Python has the Global Interpreter Lock (GIL), it has the effect of only allowing any one thread to execute Python instructions at any time.
Having a system designed around processes might have the following advantages:
Higher concurrency, especially on multiprocessor machines
Greater throughput, since you can probably spawn multiple queue consumers / DB writer processes to spread out the work. Although, the impact of this might be minimal if it is really the DB that is the bottleneck and not the process writing to the DB.

One note: before going for optimizations, it is very important to get some good measurement, and profiling.
That said, I would bet the slow part in the second step is database communication; you could try to analyze the SQL statement and its execution plan. and then optimize it (it is one of the features of SQLAlchemy); if still it would be too slow, check about database optimizations.
Of course, it is possible the bottleneck would be in a completely different place; in this case, you still have chances to optimize using C code, dedicated network, or more threads - just to give three possible example of completely different kind of optimizations.
Another point: as I/O operations usually release the GIL, you could also try to improve performance just by adding another reader thread - and I think this could be a much cheaper solution.

Put an upper limit on the amount of data in the queue?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.