Python - multiprocessing and text file processing - python

BACKGROUND:
I have a huge file .txt which I have to process. It is a data mining project.
So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:
from multiprocessing.dummy import Pool
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
process(filename)
else:
continue
In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.
How can I apply it ? I've checked this but I didn't understand how to apply it to my code...
Any help please would be appreciated.
I looked here to see how to do this. What I've tried:
pool = Pool(6)
for x in range(6):
futures.append(pool.apply_async(process, filename))
Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.
EDIT:
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
for x in range(6):
pool.apply_async(process(filename))
else:
continue

First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.
The problem you describe seems more fit for the use of one of the map functions of Pool:
from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()
This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

Related

Multiprocessing vs Threading in Python

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram
Each of output file depends on the analysis of all input files.
Single processing of the program takes quite a long time, so I tried the following codes:
(a) multiprocessing
start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
p.apply_async(my_read_process_and_write_func, args=(i,w))
p.close()
p.join()
end = time.time()
(b) threading
start = time.time()
thread_count = cpu_count()
thread_list = []
for i in range(0, thread_count):
t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
thread_list.append(t)
for t in thread_list:
t.start()
for t in thread_list:
t.join()
end = time.time()
I am runing these codes using Python 3.6 on a Windows PC with 8 cores. However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.
My questions are:
Are my codes correct?
Is there any better way/codes to improve the efficiency?
Thanks!
Your processing is I/O bound, not CPU bound. As a result, the fact that you have multiple processes helps little. Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing. Increasing the Pool size in multiprocessing should improve performance.
Follwing Tarik's answer, since my processing is I/O bound, I made serveral copies of input files, then each processing reads and processes different copy of these files.
Now my codes run 8 times faster.
Now my processing diagram looks like this.
My input files include one index file (about 400MB) and 100 other files(each size=330MB, can be considered as a file pool).
In order to generate one output file, index file and all flles within the file pool need to be read. (e.g. First line of index file is 15, then line 15 of each files within the file pool need to be read to generate output file1.)
Previously I tried multiprocessing and Threading without making copies, the codes were very slow. Then I optimized the codes by making copies of only the index file for each processing, so each processing reads copies of index file individually, and then reads the file pool to generate the output files.
Currently, with 8 cpu cores, multiprocessing with poolsize=8 takes least time.

Python multiprocessing doesn't finish all tasks

I have a lot of files that need to be processed by some software. They don't need to be processed in the order.
Let's say I have 12 files and divided them in three lists then tried to send these lists to different processes to be executed:
# import all files
files = glob.glob(src_path + "*.fits")
files_list = [files[0::3], files[1::3], files[2::3]]
num_processors = 3 # Create a pool of processors
p = Pool(processes = num_processors) # get them to work in parallel
output = pool.map(run2, [f for f in files_list])
def run2(files, *args):
for ffit in files:
terminal_astrometry(command)
def terminal_astrometry(command):
result = subprocess.run(command, stdout=subprocess.PIPE)
The problem is that sometimes, the program doesn't process all of these files, i.e. 11 files do get processed but one does not. Or other time, 9 finished but 3 were skipped. Sometimes it does finish all tasks(process all of the files).
Essentially, in run2() function I am calling that particular software that I want to be run in parallel (Astrometry.net) on every file run2() function received.
EDIT2: I trimmed run2() function because it contains a lot of calculation(statistics) not relevant to a problem here(at least I think so) and posted it here.
Your symptoms sound like a race condition, however pool.map blocks the main process until all tasks have finished so the code will not progress past that line until all tasks have finished. Therefore, I think the problem may be within the run2 function - could you post its code?
Edit: I previously had the following text in the answer too, the question has now been edited:
You are calling run2 twice for each file - once asynchronously with the pool, and once in the main process. Depending on the logic within this function, this could be the cause of the odd behaviour you're seeing.
Software that I'm calling inside the run2() function is causing problems. It tries to write stdout in the same file which causes it to not complete all the tasks.

line 105, in spawn_main exitcode = _main(fd) while multiprocessing in a for loop

I had a look into many published issues without finding some insights to my current issue.
I am dealing with multiprocessing runs of an external code. This external code eats inputs files. The files names are joined in a list that enable me to launch pool for each file. A path is also needed.
for i in range(len(file2run)):
pool.apply_async(runcase, args=(file2run[i], filepath))
The runcase function launches one process for a given input file and analyses and saves the results in some folder.
it works fine whatever the length of the file2run is. The external code runs on several processes (as many as maxCPU : defined in the pool with:
pool = multiprocessing.Pool(processes = maxCPU).
My issue is that I'd like to make a step further and integrate this in a for loop. In each loop, several input files are created and once all of the runs are finished a new set of inputs files are created and a pool is created again.
It works fine for two loops but I encountered the issue of the xxx line 105, in spawn_main exitcode = _main(fd) and a bunch of messages up the error of a missing needed module. Same messages for 2 or 1000 input files in each loop...
So I guess it's about the pool creation, but is there a way of clearing the variables between each runs ?? I have tried to created the pool initialization (with the number of CPU) at the very beginning of the main function but same issues raises...I have tried to make a sort of equivalent of clear all matlab function but always same issue... and why does it work for two loops and not for the third one ? why is the 2nd one working??
Thanks in advance for any help (or to point out to the good already published issue).
Xavfa
here is a try of an example that actually......works !
I copy\paste my original script and made it way more easier to share for sake of understanding the paradigm of my original try (the original one deals with object of several kinds to build the input file and uses an embedded function of one of the objects to launches the external code with subprocess.check_all).
but the example keeps the over all paradigm of making input files in a folder, simulation results in an other one with multiprocessing package.
the original still doesn't work, still at the third round of the loop (if name == 'main' : of multiproc_test.py
here is one script (multiproc_test.py):
import os
import Simlauncher
def RunProcess(MainPath):
file2run = Simlauncher.initiateprocess(MainPath)
Simlauncher.RunMultiProc(file2run, MainPath, multi=True, maxcpu=0.7)
def LaunchProcess(nbcase):
#exemple that build the file
MainPath = os.getcwd()
SimDir = os.path.join(os.getcwd(), 'SimFiles\\')
if not os.path.exists(SimDir):
os.mkdir(SimDir)
for i in range(100):
with open(SimDir+'inputfile'+str(i)+'.mptest', 'w') as file:
file.write('Hello World')
RunProcess(MainPath)
if __name__ == '__main__' :
for i in range(1,10):
LaunchProcess(i)
os.rename(os.path.join(os.getcwd(), 'SimFiles'), os.path.join(os.getcwd(), 'SimFiles'+str(i)))
here is the other one (Simlauncher.py) :
import multiprocessing as mp
import os
def initiateprocess(MainPath):
filepath = MainPath + '\\SimFiles\\'
listOfFiles = os.listdir(filepath)
file2run = []
for file in listOfFiles:
if '.mptest' in file:
file2run.append(file)
return file2run
def runtestcase(file,filepath):
filepath = filepath+'\\SimFiles'
ResSimpath = filepath + '\\SimRes\\'
if not os.path.exists(ResSimpath):
os.mkdir(ResSimpath)
with open(ResSimpath+'Res_' + file, 'w') as res:
res.write('I am done')
print(file +'is finished')
def RunMultiProc(file2run, filepath, multi, maxcpu):
print('Launching cases :')
nbcpu = mp.cpu_count()
pool = mp.Pool(processes=int(nbcpu * maxcpu))
for i in range(len(file2run)):
pool.apply_async(runtestcase, args=(file2run[i], filepath))
pool.close()
pool.join()
print('Done with this one !')
any help is still needed....
btw, the external code is energyplus (for building energy simulation)
Xavier

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join
Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

python -> multiprocessing module

Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

Categories

Resources