Correct way to process giant lines with Multiprocessing - python

I'm a bit into a dilemma on 'how to process giant lines of file with Multiprocessing'. In the plan, I will send a requests for each line. The file contains a giant list of Subdomain consist of 1-10 Million line, altho the file size is under 1 GB but I worried that my Implementation are going to exhaust my Memory.
I'm using Queue() to distribute the task to each process. Here's what I do:
from multiprocessing import Queue
tasker = Queue()
with open(processor, 'r') as f:
for line in f:
tasker.put(line.strip())
The Multiprocessing will eventually slow-down, so it must be because of Queue() are growing faster than the process could complete. As for Alternative, I use itertools.islice() function that will produce 4 task and then Immediately consume it.
from itertools import islice
from multiprocessing import Queue
tasker = Queue()
with open('file.txt', 'r') as f:
for line in f:
liner = [line] + list(islice(f, 4))
for i in liner:
tasker.put(str(re.sub('\n', '', i.strip())))
Finally, the Multiprocessing are running in a reasonable time without Slowing Down but it creates a list containing 4 item that will be appended into the Queue() until the list Empty. So it's kinda putting items from 1 list into another list, weird, I know.
Is there a better way to handle this without stressing too much Memory?

the two code snippets are equivalent in the produced output, the second one is only slower, and is only fixing the problem by "stalling the CPU".
the correct way to limit the memory usage is to limit the queue itself, the queue accepts its max size as an argument.
from multiprocessing import Queue
max_size = 1000
tasker = Queue(max_size)
if the main process tries to put more tasks in it then it will block the main process until the workers finish some of the work, this will be fast and efficient in terms of CPU and memory usage, (the main process will allow workers to use its CPU core to finish some of the work, hence more works gets done, without consuming more memory)
if you don't want the main process to be blocked as it has other things to do then you can have a thread in the main process putting tasks in the queue using the threading module
if you want to reduce the overhead of locking the queue, then you should submit work in chunks.
from itertools import islice
from multiprocessing import Queue
max_size = 200 # note: 200 * 5 = 1000
tasker = Queue(max_size)
with open('file.txt', 'r') as f:
for line in f:
liner = [line] + list(islice(f, 4))
liner = [x.strip() for x in liner]
tasker.put(liner) # put the list of 5 items in the queue
and the consumer is expected to loop over the list obtained from the queue and perform each task in the received list independently.

Related

What is the safest way to queue multiple threads originating in a loop?

My script loops through each line of an input file and performs some actions using the string in each line. Since the tasks performed on each line are independent of each other, I decided to separate the task into threads so that the script doesn't have to wait for the task to complete to continue with the loop. The code is given below.
def myFunction(line, param):
# Doing something with line and param
# Sends multiple HTTP requests and parse the response and produce outputs
# Returns nothing
param = arg[1]
with open(targets, "r") as listfile:
for line in listfile:
print("Starting a thread for: ",line)
t=threading.Thread(target=myFunction, args=(line, param,))
threads.append(t)
t.start()
I realized that this is a bad idea as the number of lines in the input file grew large. With this code, there would be as many threads as the number of lines. Researched a bit and figured that queues would be the way.
I want to understand the optimal way of using queues for this scenario and if there are any alternatives which I can use.
To go around this problem, you can use the concept of Thread Pools, where you define a fixed number of Threads/workers to be used, for example 5 workers, and whenever a Thread finishes executing, an other Future(ly) submmited thread would take its place automatically.
Example :
import concurrent.futures
def myFunction(line, param):
print("Done with :", line, param)
param = "param_example"
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
with open("targets", "r") as listfile:
for line in listfile:
print("Starting a thread for: ", line)
futures.append(executor.submit(myFunction, line=line, param=param))
# waiting for the threads to finish and maybe print a result :
for future in concurrent.futures.as_completed(futures):
print(future.result()) # an Exceptino should be handled here!!!
Queues are one way to do it. The way to use them is to put function parameters on a queue, and use threads to get them and do the processing.
The queue size doesn't matter too much in this case because reading the next line is fast. In another case, a more optimized solution would be to set the queue size to at least twice the number of threads. That way if all threads finish processing an item from the queue at the same time, they will all have the next item in the queue ready to be processed.
To avoid complicating the code threads can be set as daemonic so that they don't stop the program from finishing after the processing is done. They will be terminated when the main process finishes.
The alternative is to put a special item on the queue (like None) for each thread and make the threads exit after getting it from the queue and then join the threads.
For the examples bellow the number of worker threads is set using the workers variable.
Here is an example of a solution using a queue.
from queue import Queue
from threading import Thread
queue = Queue(workers * 2)
def work():
while True:
myFunction(*queue.get())
queue.task_done()
for _ in range(workers):
Thread(target=work, daemon=True).start()
with open(targets, 'r') as listfile:
for line in listfile:
queue.put((line, param))
queue.join()
A simpler solution might be using ThreadPoolExecutor. It is especially simple in this case because the function being called doesn't return anything that needs to be used in the main thread.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=workers) as executor:
with open(targets, 'r') as listfile:
for line in listfile:
executor.submit(myFunction, line, param)
Also, if it's not a problem to have all lines stored in memory, there is a solution which doesn't use anything other than threads. The work is split in such a way that the threads read some lines from a list and ignore other lines. A simple example with two threads is where one thread reads odd lines and the other reads even lines.
from threading import Thread
with open(targets, 'r') as listfile:
lines = listfile.readlines()
def work_split(n):
for line in lines[n::workers]:
myFunction(line, param)
threads = []
for n in range(workers):
t = Thread(target=work_split, args=(n,))
t.start()
threads.append(t)
for t in threads:
t.join()
I have done a quick benchmark and the Queue is slightly faster than the ThreadPoolExecutor, but the solution with the split work is faster than both.
From the code you have reported, has no sense the use of thread.
This because there aren't any I/O operations, and so the threads are executed in a linear way without multithread. The GIL (Global Interpreter Lock) is never released by a thread in this case, so the application is only apparently using multithreading, in reality the interpreter is using only one CPU for the program and one thread at time.
In this way you don't have any advantages on use of thread, on the contrary you can have a performance degradation for this scenario, due to the switch context, and to the thread initialization overhead when a thread starts.
The only way to have better performance in this scenario, if applicable in this case, is a multiprocess program. But pay attention on the number of process that you start, remember that every process has its own interpreter.
It was a good answer by GitFront. This answer just adds one more option using the multiprocessing package.
Using concurrent.futures or multiprocessing depends on particular requirements. Multiprocessing has a lot more options comparatively but for the given question the results should be near identical in the simplest case.
from multiprocessing import cpu_count, Pool
PROCESSES = cpu_count() # Warning: uses all cores
def pool_method(listfile, param):
p = Pool(processes=PROCESSES)
checker = [p.apply_async(myFunction, (line, param)) for line in listfile]
...
There are various other methods too other than "apply_async", but this should work well for your needs.

Why is reading multiple files at the same time slower than reading sequentially?

I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool
from api.ttypes import *
import gc
import os
def _parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
return Points(myList)
def getParsedFiles(pathToFile):
myList = []
p = Pool(2)
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
myList.append(filename)
return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)
Graph:
Machine specification:
62.8 GiB RAM
Intel® Core™ i7-6850K CPU # 3.60GHz × 12
What am I missing here ?
Thanks in advance!
Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.
As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines):
return Points([_parse_coreset_point(line) for line in lines])
def _parse_coreset_point(line):
s = line.split()
x, y = [int(v) for v in s]
return CoresetPoint(x, y)
And our main function:
import fileinput
def getParsedFiles(directory):
pool = Pool(2)
txts = [filename for filename in os.listdir(directory):
if filename.endswith(".txt")]
return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)
In general it is never a good idea to read from the same physical (spinning) hard disk from different threads simultaneously, because every switch causes an extra delay of around 10ms to position the read head of the hard disk (would be different on SSD).
As #peter-wood already said, it is better to have one thread reading in the data, and have other threads processing that data.
Also, to really test the difference, I think you should do the test with some bigger files. For example: current hard disks should be able to read around 100MB/sec. So reading the data of a 100kB file in one go would take 1ms, while positioning the read head to the beginning of that file would take 10ms.
On the other hand, looking at your numbers (assuming those are for a single loop) it is hard to believe that being I/O bound is the only problem here. Total data is 100MB, which should take 1 second to read from disk plus some overhead, but your program takes 130 seconds. I don't know if that number is with the files cold on disk, or an average of multiple tests where the data is already cached by the OS (with 62 GB or RAM all that data should be cached the second time) - it would be interesting to see both numbers.
So there has to be something else. Let's take a closer look at your loop:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
While I don't know Python, my guess would be that the gc calls are the problem here. They are called for every line read from disk. I don't know how expensive those calls are (or what if gc.enable() triggers a garbage collection for example) and why they would be needed around append(obj) only, but there might be other problems because this is multithreading:
Assuming the gc object is global (i.e. not thread local) you could have something like this:
thread 1 : gc.disable()
# switch to thread 2
thread 2 : gc.disable()
thread 2 : myList.append(obj)
thread 2 : gc.enable()
# gc now enabled!
# switch back to thread 1 (or one of the other threads)
thread 1 : myList.append(obj)
thread 1 : gc.enable()
And if the number of threads <= number of cores, there wouldn't even be any switching, they would all be calling this at the same time.
Also, if the gc object is thread safe (it would be worse if it isn't) it would have to do some locking in order to safely alter it's internal state, which would force all other threads to wait.
For example, gc.disable() would look something like this:
def disable()
lock() # all other threads are blocked for gc calls now
alter internal data
unlock()
And because gc.disable() and gc.enable() are called in a tight loop, this will hurt performance when using multiple threads.
So it would be better to remove those calls, or place them at the beginning and end of your program if they are really needed (or only disable gc at the beginning, no need to do gc right before quitting the program).
Depending on the way Python copies or moves objects, it might also be slightly better to use myList.append(CoresetPoint(x, y)).
So it would be interesting to test the same on one 100MB file with one thread and without the gc calls.
If the processing takes longer than the reading (i.e. not I/O bound), use one thread to read the data in a buffer (should take 1 or 2 seconds on one 100MB file if not already cached), and multiple threads to process the data (but still without those gc calls in that tight loop).
You don't have to split the data into multiple files in order to be able to use threads. Just let them process different parts of the same file (even with the 14GB file).
A copy-paste snippet, for people who come from Google and don't like reading
Example is for json reading, just replace __single_json_loader with another file type to work with that.
from multiprocessing import Pool
from typing import Callable, Any, Iterable
import os
import json
def parallel_file_read(existing_file_paths: Iterable[str], map_lambda: Callable[[str], Any]):
result = {p: None for p in existing_file_paths}
pool = Pool()
for i, (temp_result, path) in enumerate(zip(pool.imap(map_lambda, existing_file_paths), result.keys())):
result[path] = temp_result
pool.close()
pool.join()
return result
def __single_json_loader(f_path: str):
with open(f_path, "r") as f:
return json.load(f)
def parallel_json_read(existing_file_paths: Iterable[str]):
combined_result = parallel_file_read(existing_file_paths, __single_json_loader)
return combined_result
And usage
if __name__ == "__main__":
def main():
directory_path = r"/path/to/my/file/directory"
assert os.path.isdir(directory_path)
d: os.DirEntry
all_files_names = [f for f in os.listdir(directory_path)]
all_files_paths = [os.path.join(directory_path, f_name) for f_name in all_files_names]
assert(all(os.path.isfile(p) for p in all_files_paths))
combined_result = parallel_json_read(all_files_paths)
main()
Very straight forward to replace a json reader with any other reader, and you're done.

How can I optimize processes with a Pool or Queue in large batch processing?

I'm trying to execute a function on every line of a CSV file as fast as possible. My code works, but I know it could be faster if I make better use of the multiprocessing library.
processes = []
def execute_task(task_details):
#work is done here, may take 1 second, may take 10
#send output to another function
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row in r:
p = Process(target=execute_task, args=(row,))
processes.append(p)
p.start()
for p in processes:
p.join()
I'm thinking I should put the tasks into a Queue and process them with a Pool but all the examples make it seem like Queue doesn't work the way I assume, and that I can't map a Pool to an ever expanding Queue.
I've done something similar using a Pool of workers.
from multiprocessing import Pool, cpu_count
def initializer(arg1, arg2):
# Do something to initialize (if necessary)
def process_csv_data(data):
# Do something with the data
pool = Pool(cpu_count(), initializer=initializer, initargs=(arg1, arg2))
with open("csv_data_file.csv", "rb") as f:
csv_obj = csv.reader(f)
for row in csv_obj:
pool.apply_async(process_csv_data, (row,))
However, as pvg commented under your question, you might want to consider how to batch your data. Going row by row may not the the right level of granularity.
You might also want to profile/test to figure out the bottle-neck. For example, if disk access is limiting you, you might not benefit from parallelizing.
mulprocessing.Queue is a means to exchanging objects among the processes, so it's not something you'd put a task into.
For me it looks like you are actually trying to speed up
def check(row):
# do the checking
return (row,result_of_check)
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row,result in map(check,r):
print(row,result)
which can be done with
#from multiprocessing import Pool # if CPU-bound (but even then not alwys)
from multiprocessing.dummy import Pool # if IO-bound
def check(row):
# do the checking
return (row,result_of_check)
if __name__=="__main__": #in case you are using processes on windows
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
with Pool() as p: # before python 3.3 you should do close() and join() explicitly
for row,result in p.imap_unordered(check,r, chunksize=10): # just a quess - you have to guess/experiement a bit to find the best value
print(row,result)
Creating processes takes some time (especially on windows) so in most cases using threads via multiprocessing.dummy is faster (and also multiprocessing is not totally trivial - see Guidelines).

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join
Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

How to parse a large file taking advantage of threading in Python?

I have a huge file and need to read it and process.
with open(source_filename) as source, open(target_filename) as target:
for line in source:
target.write(do_something(line))
do_something_else()
Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?
edit: To make this question not a discussion, How should the code look like?
with open(source_filename) as source, open(target_filename) as target:
?
#Nicoretti: In an iteration I need to read a line of several KB of data.
update 2: the file may be a bz2, so Python may have to wait for unpacking:
$ bzip2 -d country.osm.bz2 | ./my_script.py
You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.
import threading
import Queue
QUEUE_SIZE = 1000
sentinel = object()
def read_file(name, queue):
with open(name) as f:
for line in f:
queue.put(line)
queue.put(sentinel)
def process(inqueue, outqueue):
for line in iter(inqueue.get, sentinel):
outqueue.put(do_something(line))
outqueue.put(sentinel)
def write_file(name, queue):
with open(name, "w") as f:
for line in iter(queue.get, sentinel):
f.write(line)
inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)
threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)
It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.
Does the processing stage take relatively long time, ie, is it cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. If your processing is expensive, then yes. So, you need to profile to know for sure.
If you spend relatively more time reading the file, ie it is big, than processing it, then you can't win in performance by using threads, the bottleneck is just the IO which threads dont improve.
This is the exact sort of thing which you should not try to analyse a priori, but instead should profile.
Bear in mind that threading will only help if the per-line processing is heavy. An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading.
Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads.
There is another alternative: spawn sub-processes, and have them do the reading, and the processing. Given your description of the problem, I would expect this to give you the greatest speed-up. You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database).
In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. So threading only benefits you if either:
you are doing processing that doesn't require the global interpreter lock; or
you are spending time blocked on I/O.
Examples of (1) include applying a filter to an image in the Python Imaging Library, or finding the eigenvalues of a matrix in numpy. Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data.
So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order.
Here's a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. Only testing will tell you if this faster or slower than the single-threaded version (or Janne's version with only three threads).
from threading import Thread
from Queue import Queue
def process_file(f, source_filename, target_filename):
"""
Apply the function `f` to each line of `source_filename` and write
the results to `target_filename`. Each call to `f` is evaluated in
a separate thread.
"""
worker_queue = Queue()
finished = object()
def process(queue, line):
"Process `line` and put the result on `queue`."
queue.put(f(line))
def read():
"""
Read `source_filename`, create an output queue and a worker
thread for every line, and put that worker's output queue onto
`worker_queue`.
"""
with open(source_filename) as source:
for line in source:
queue = Queue()
Thread(target = process, args=(queue, line)).start()
worker_queue.put(queue)
worker_queue.put(finished)
Thread(target = read).start()
with open(target_filename, 'w') as target:
for output in iter(worker_queue.get, finished):
target.write(output.get())

Categories

Resources