I have a script that uses multiprocessing to open and perform calculation on ~200k .csv files. Here's the workflow:
1) Considering a folder with ~200k .csv files. Each .csv file contains the folowing:
.csv file example:
0, 1
2, 3
4, 5
...
~500 rows
2) The script saves a list of all .csv files in a list()
3) The script divides the list with ~200k .csv files into 8 lists since I have 8 processors available.
4) The script calls do_something_with_csv() 8 times and perform calculation in parallel.
In linear mode, the execution takes around 4 min.
In parallel and in series, if I execute the script for the first time, it takes a much longer time. If I execute for the second, third etc. time, it takes around 1min. Seems like python is caching the IO operations of some sort? It looks like because I have a progress bar, and for example, if I execute until the progress bar is 5k/200k and terminate the program, the next execution will go through the first 5k runs very quickly and then slow down.
Python version: 3.6.1
Pseudo Python code:
def multiproc_dispatch():
lst_of_all_csv_files = get_list_of_files('/path_to_csv_files')
divided_lst_of_all_csv_files = split_list_chunks(lst_of_all_csv_files, 8)
manager = Manager()
shared_dict = manager.dict()
jobs = []
for lst_of_all_csv_files in divided_lst_of_all_csv_files:
p = Process(target=do_something_with_csv, args=(shared_dict, lst_of_all_csv_files))
jobs.append(p)
p.start()
# Wait for the worker to finish
for job in jobs:
job.join()
def read_csv_file(csv_file):
lst_a = []
lst_b = []
with open(csv_file, 'r') as f_read:
csv_reader = csv.reader(f_read, delimiter = ',')
for row in csv_reader:
lst_a.append(float(row[0]))
lst_b.append(float(row[1]))
return lst_a, lst_b
def do_something_with_csv(shared_dict, lst_of_all_csv_files):
temp_dict = lambda: defaultdict(self.mydict)()
for csv_file in lst_of_all_csv_files:
lst_a, lst_b = read_csv_file(csv_file)
temp_dict[csv_file] = (lst_a, lst_b)
shared_dict.update(temp_dict)
if __name__ == '__main__':
multiproc_dispatch()
This is without a doubt RAM caching coming into play, meaning that loading your files is faster the second time as data is already in RAM and is not coming from disk. (struggling to find good references here, any help welcome)
This has nothing to do with multiprocessing, not even with python itself.
Irrelevant since question edit I think the cause of the longer duration taken by your code when run in parallel comes from your shared_dict variable that is accessed from within each subprocess (see e.g. here). Creating and sending data between processes in python is slow and should be reduced to minimum (here you could return one dict per job then merge them).
Related
I have a very large python code. The fundamental of that is I have a function which use a row of Dataframe and apply some formulas and save the object i've create whit joblib in my files. (Im gonna put a function to capture the essence of the script).
import Multiprocessing as multi
def somefunct(DataFrame_row, some_parameter1, some_parameter2, sema):
python_object = My_object(DataFrame_row['Column1'],DataFrame_row['Column2'])
python_object.some_complicate_method(some_parameter1, some_parameter2)
# for example calculate an integral of My_object data
#takes 50-60 second aprox per row
joblib.dump(python_object, path_save)
#Before of tried a function that save the object i tried afunction that
#save the object in the DataFrame
sema.release()
def apply_all_data_frame(df, n_procces):
sema = multi.Semaphore(n_procesos)
procesos_list = []
for index, row in df.iterrow():
sema.acquire()
p = multi.Process(target = somefunct,
args = (row, some_parameter1, some_parameter2, sema))
procesos_list.append(p)
p.start()
for proceso in procesos_list:
proceso.join()
So, the DataFrame contain 5000 rows and it maybe contain more in the future. I test the script with a data with 100 rows in a computer with 16 cores and 32 logic processor. I choose 30 process and with 100 rows use the 30 process(100% CPU) and finish quickly. But when i try again with all the data the computer only use 4 or 3 process (11%) and use 2.0 gb of RAM each process. Take to long.
My first try with the program was use Pool and Pool.map, but in that case is the same problem and full the RAM an broke everything despite having use less process (16 i think).
I've coment in the script that my first program was saving the object in the DataFrame but when i see that the RAM full 100% i decided to save the object. In that case i tried the Pool and freezing all, because create a python process with 0% work in the CPU.
I tried the function without Semaphore to.
I'm apologize for the English and for the explanation, is my first question online.
screenshot of how the process of computer works
I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv
welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.
There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).
I am new here but I wanted to ask something regarding multiprocessing.
So I have some huge raster tiles that I process and extract information and I found that delivering tons of pickle files is faster than appending to a dataframe. The point is that I loop over each of my tiles for processing and I create pools inside a for loop
#This creates a directory for my pickle files
if not os.path.exists('pkl_tmp'):
os.mkdir('pkl_tmp')
Here I start looping over each one of my tiles and create a pool with the grid cells that I want to process, then I use the map function to apply all my nasty processing to each cell of my grid.
for GHSL_tile in ROI.iloc[4:].itertuples():
ct += 1
L18_cells = GHSL_query(GHSL_tile, L18_grid)
vector_tile = poligonize_tile(GHSL_tile)
print(datetime.today())
subdir = './pkl_tmp/{}/'.format(ct)
if not os.path.exists(subdir):
os.mkdir(subdir)
if vector_tile is not None:
# assign how many cores will be used
num_processes = int(multiprocessing.cpu_count() - 15)
chunk_size = 1 # chunk size set to 1 to return cell like outputs
# break the dataframe as a list
chunks = [L18_cells.iloc[i:i + chunk_size, :] for i in range(0, L18_cells.shape[0], chunk_size)]
pool = multiprocessing.Pool(processes=num_processes)
result = pool.map(process_cell, chunks)
del result
else:
print('Tile # {} skipped'.format(ct))
print('GHSL database created')
In fact this has not errors, it takes around 2 days executing due to the size of my data and sometimes I had many iddle cores (specially towards the end of a tile).
My question is:
I tried using map_async instead of map and it was creating files really fast, even sometimes processes multiple tiles at the same time which is wonderful, the problem is that when it creates the directory for my last tile, the code gets out of the for loop and many tasks end up not being executed. What am I doing wrong? How can I make the map_async function work better or how can I avoid iddle cores (slow down) when I use the map function?
Thank you in advance
PC resources is definitely not a problem
I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool
from api.ttypes import *
import gc
import os
def _parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
return Points(myList)
def getParsedFiles(pathToFile):
myList = []
p = Pool(2)
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
myList.append(filename)
return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)
Graph:
Machine specification:
62.8 GiB RAM
Intel® Core™ i7-6850K CPU # 3.60GHz × 12
What am I missing here ?
Thanks in advance!
Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.
As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines):
return Points([_parse_coreset_point(line) for line in lines])
def _parse_coreset_point(line):
s = line.split()
x, y = [int(v) for v in s]
return CoresetPoint(x, y)
And our main function:
import fileinput
def getParsedFiles(directory):
pool = Pool(2)
txts = [filename for filename in os.listdir(directory):
if filename.endswith(".txt")]
return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)
In general it is never a good idea to read from the same physical (spinning) hard disk from different threads simultaneously, because every switch causes an extra delay of around 10ms to position the read head of the hard disk (would be different on SSD).
As #peter-wood already said, it is better to have one thread reading in the data, and have other threads processing that data.
Also, to really test the difference, I think you should do the test with some bigger files. For example: current hard disks should be able to read around 100MB/sec. So reading the data of a 100kB file in one go would take 1ms, while positioning the read head to the beginning of that file would take 10ms.
On the other hand, looking at your numbers (assuming those are for a single loop) it is hard to believe that being I/O bound is the only problem here. Total data is 100MB, which should take 1 second to read from disk plus some overhead, but your program takes 130 seconds. I don't know if that number is with the files cold on disk, or an average of multiple tests where the data is already cached by the OS (with 62 GB or RAM all that data should be cached the second time) - it would be interesting to see both numbers.
So there has to be something else. Let's take a closer look at your loop:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
While I don't know Python, my guess would be that the gc calls are the problem here. They are called for every line read from disk. I don't know how expensive those calls are (or what if gc.enable() triggers a garbage collection for example) and why they would be needed around append(obj) only, but there might be other problems because this is multithreading:
Assuming the gc object is global (i.e. not thread local) you could have something like this:
thread 1 : gc.disable()
# switch to thread 2
thread 2 : gc.disable()
thread 2 : myList.append(obj)
thread 2 : gc.enable()
# gc now enabled!
# switch back to thread 1 (or one of the other threads)
thread 1 : myList.append(obj)
thread 1 : gc.enable()
And if the number of threads <= number of cores, there wouldn't even be any switching, they would all be calling this at the same time.
Also, if the gc object is thread safe (it would be worse if it isn't) it would have to do some locking in order to safely alter it's internal state, which would force all other threads to wait.
For example, gc.disable() would look something like this:
def disable()
lock() # all other threads are blocked for gc calls now
alter internal data
unlock()
And because gc.disable() and gc.enable() are called in a tight loop, this will hurt performance when using multiple threads.
So it would be better to remove those calls, or place them at the beginning and end of your program if they are really needed (or only disable gc at the beginning, no need to do gc right before quitting the program).
Depending on the way Python copies or moves objects, it might also be slightly better to use myList.append(CoresetPoint(x, y)).
So it would be interesting to test the same on one 100MB file with one thread and without the gc calls.
If the processing takes longer than the reading (i.e. not I/O bound), use one thread to read the data in a buffer (should take 1 or 2 seconds on one 100MB file if not already cached), and multiple threads to process the data (but still without those gc calls in that tight loop).
You don't have to split the data into multiple files in order to be able to use threads. Just let them process different parts of the same file (even with the 14GB file).
A copy-paste snippet, for people who come from Google and don't like reading
Example is for json reading, just replace __single_json_loader with another file type to work with that.
from multiprocessing import Pool
from typing import Callable, Any, Iterable
import os
import json
def parallel_file_read(existing_file_paths: Iterable[str], map_lambda: Callable[[str], Any]):
result = {p: None for p in existing_file_paths}
pool = Pool()
for i, (temp_result, path) in enumerate(zip(pool.imap(map_lambda, existing_file_paths), result.keys())):
result[path] = temp_result
pool.close()
pool.join()
return result
def __single_json_loader(f_path: str):
with open(f_path, "r") as f:
return json.load(f)
def parallel_json_read(existing_file_paths: Iterable[str]):
combined_result = parallel_file_read(existing_file_paths, __single_json_loader)
return combined_result
And usage
if __name__ == "__main__":
def main():
directory_path = r"/path/to/my/file/directory"
assert os.path.isdir(directory_path)
d: os.DirEntry
all_files_names = [f for f in os.listdir(directory_path)]
all_files_paths = [os.path.join(directory_path, f_name) for f_name in all_files_names]
assert(all(os.path.isfile(p) for p in all_files_paths))
combined_result = parallel_json_read(all_files_paths)
main()
Very straight forward to replace a json reader with any other reader, and you're done.
I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join
Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)