Parallel processing of large xml files in python

Parallel processing of large xml files in python - python

I have several large xml files that I am parsing (extracting some subset of the data and writing to file), but there are lots of files and lots of records per file, so I'm attempting to parallelize.
To start, I have a generator that pulls records from the file (this works fine):
def reader(files):
n=0
for fname in files:
chunk = ''
with gzip.open(fname, "r") as f:
for line in f:
line = line.strip()
if '<REC' in line or ('<REC' in chunk and line != '</REC>'):
chunk += line
continue
if line == '</REC>':
chunk += line
n += 1
yield chunk
chunk = ''
A process function (details not relevant here, but also works fine):
def process(chunk,fields='all'):
paper = etree.fromstring(chunk)
#
# extract some info from the xml
#
return result # result is a string
Now of course the naive, non-parallel way to do this would be as simple as:
records = reader(files)
with open(output_filepath,'w') as fout:
for record in records:
result = process(record)
fout.write(result+'\n')
But now I want to parallelize this. I first considered doing a simple map-based approach, with each process handling one of the files, but the files are of radically different sizes (and some are really big), so this would be a pretty inefficient use of parallelization, I think. This is my current approach:
import multiprocessing as mp
def feed(queue, records):
for rec in records:
queue.put(rec)
queue.put(None)
def calc(queueIn, queueOut):
while True:
rec = queueIn.get(block=True)
if rec is None:
queueOut.put('__DONE__')
break
result = process(rec)
queueOut.put(result)
def write(queue, file_handle):
records_logged = 0
while True:
result = queue.get()
if result == '__DONE__':
logger.info("{} --> ALL records logged ({})".format(file_handle.name,records_logged))
break
elif result is not None:
file_handle.write(result+'\n')
file_handle.flush()
records_logged +=1
if records_logged % 1000 == 0:
logger.info("{} --> {} records complete".format(file_handle.name,records_logged))
nThreads = N
records = reader(filelist)
workerQueue = mp.Queue()
writerQueue = mp.Queue()
feedProc = mp.Process(target = feed , args = (workerQueue, records))
calcProc = [mp.Process(target = calc, args = (workerQueue, writerQueue)) for i in range(nThreads)]
writProc = mp.Process(target = write, args = (writerQueue, handle))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join()
for p in calcProc:
p.join()
writProc.join()
feedProc.terminate()
writProc.terminate()
for p in calcProc:
p.terminate()
workerQueue.close()
writerQueue.close()
Now, this works in the sense that everything gets written to file, but then it just hangs when trying to join the processes at the end, and I'm not sure why. So, my main question is, what am I doing wrong here such that my worker processes aren't correctly terminating, or signaling that they're done?
I think I could solve this problem the "easy" way by adding timeouts to the calls to join but this (a) seems like a rather inelegant solution here, as there are clear completion conditions to the task (i.e. once every record in the file has been processed, we're done), and (b) I'm worried that this could introduce some problem (e.g. if I make the timeout too short, couldn't it terminate things before everything has been processed? And of course making it too long is just wasting time...).
I'm also willing to consider a totally different approach to parallelizing this if anyone has ideas (the queue just seemed like a good choice since the files are big and just reading and generating the raw records takes time).
Bonus question: I'm aware that this approach in no way guarantees that the output I'm writing to file will be in the same order as the original data. This is not a huge deal (sorting the reduced/processed data won't be too unwieldy), but maintaining order would be nice. So extra gratitude if anyone has a solution that ensure that will preserve the original order.

Related

I need to extract data from a text file that contains >8 million records. What is the best suited language to do this using Multithreading?

Currently I am using Multiprocessing feature of Python. Though this works fine for text file up to 2 million records, it fails for the file with 8 million records with "Can't access lock."
Moreover, it takes about 30 minutes to process the file with 2 million records and fails like after about an hour or so for the big file.
I am doing this:
def try_multiple_operations(item):
aab_type = item[15:17]
aab_amount = item[35:46]
aab_name = item[82:100]
aab_reference = item[64:82]
if aab_type not in '99' or 'Z5':
aab_record = f'{aab_name} {aab_amount} {aab_reference}'
else:
aab_record = 'ignore'
return aab_record
Calling the try_multiple_operations in the __main__:
if __name__ == '__main__':
//some other code
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, item) for item in aab ]
concurrent.futures.wait(futures)
aab_list = [x.result() for x in futures]
aab_list.sort()
//some other code for further processing
I have used pandas/dataframes too. I am able to do a bit of the processing using that. However, I want to be able to retain the original format of the file after processing which dataframes make a bit tricky as they return data in either ndarray format or acsv format.
I would like to understand if there is a faster way of doing this, maybe using some other programming language.

As advised by #wwii, updated the code to get rid of the multiprocessing. This made the code a lot faster.
def try_multiple_operations(items):
data_aab = []
value = ['Z5','RA','Z4','99', 99]
for item in items:
aab_type = item[15:17]
aab_amount = item[35:46]
aab_name = item[82:100]
aab_reference = item[64:82]
if aab_type not in value:
# aab_record = f'{aab_name} {aab_amount} {aab_reference}'
data_aab.append(f'{aab_name} {aab_amount} {aab_reference}')
return data_aab
Called this as:
if __name__ == '__main__':
#some code
aab_list = try_multiple_operations(aab)
#some extra code
This, while a bit surprising for me, is a lot faster than multiprocessing.

Handle multiple results in Python multiprocessing

I'm writing a Python piece of code to parse a lot of ascii file using multiprocessing functionality.
For each file I've to perform the operations of this function
def parse_file(file_name):
record = False
path_include = []
buffer_include = []
include_file_filters = {}
include_keylines = {}
grids_lines = []
mat_name_lines = []
pids_name_lines = []
pids_shell_lines= []
pids_weld_lines = []
shells_lines = []
welds_lines = []
with open(file_name, 'rb') as in_file:
for lineID, line in enumerate(in_file):
if record:
path_include += line
if record and re.search(r'[\'|\"]$', line.strip()):
buffer_include.append(re_path_include.search(
path_include).group(1).replace('\n', ''))
record = False
if 'INCLUDE' in line and '$' not in line:
if re_path_include.search(line):
buffer_include.append(
re_path_include.search(line).group(1))
else:
path_include = line
record = True
if line.startswith('GRID'):
grids_lines += [lineID]
if line.startswith('$HMNAME MAT'):
mat_name_lines += [lineID]
if line.startswith('$HMNAME PROP'):
pids_name_lines += [lineID]
if line.startswith('PSHELL'):
pids_shell_lines += [lineID]
if line.startswith('PWELD'):
pids_weld_lines += [lineID]
if line.startswith(('CTRIA3', 'CQUAD4')):
shells_lines += [lineID]
if line.startswith('CWELD'):
welds_lines += [lineID]
include_keylines = {'grid': grids_lines, 'mat_name': mat_name_lines, 'pid_name': pids_name_lines, \
'pid_shell': pids_shell_lines, 'pid_weld': pids_weld_lines, 'shell': shells_lines, 'weld': welds_lines}
include_file_filters = {file_name: include_keylines}
return buffer_include, include_file_filters
This function is used in a loop through list of files, in this way (each process on CPU parse one entire file)
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
for include in grouper([list_of_file_path]):
current = mp.current_process()
print 'Running: ', current.name, current._identity
results = p.map(parse_file, include)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
The grouper function used above is defined as
def grouper(iterable, padvalue=None):
return itertools.izip_longest(*[iter(iterable)]*mp.cpu_count(), fillvalue=padvalue)
I'm using Python 2.7.15 in cpu with 4 cores (Intel Core i3-6006U).
When I run my code, I see all the CPUs engaged on 100%, the output in Python console as Running: MainProcess () but nothing appened otherwise. It seems that my code is blocked at instruction results = p.map(parse_file, include) and can't go ahead (the code works well when i parse the files one at a time without parallelization).
What is wrong?
How can I deal with the results given by parse_file function
during parallel execution?My approach is correct or not?
Thanks in advance for your support
EDIT
Thanks darc for your reply. I've tried your suggestion but the issue is the same. The problem, seems to be overcome if I put the code under if statement like so
if __name__ == '__main__':
Maybe this is due to the manner in which Python IDLE handle the process. I'm using the IDLE environ for development and debugging reasons.

according to python docs:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
since it is blocking your process wait until parse file is done.
since map already chnucks the iterable you can try to send all of the includes together as one large iterable.
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
results = p.map(parse_file, list_of_file_path, 1)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
if you want to keep the original loop use apply_async, or if you are using python3 you can use ProcessPoolExecutor submit() function and read the results.

Processing a huge amount of files in python

I have a huge number of report files (about 650 files) which takes about 320 M of hard disk and I want to process them. There are a lot of entries in each file; I should count and log them based on their content. Some of them are related to each other and I should find, log and count them too; matches may be in different files. I have wrote a simple script to do the job. I used python profiler and it just took about 0.3 seconds to run the script for one single file with 2000 lines that we need half of them for processing. But for the whole directory it took 1 hour and a half to be done. This is how my script looks like:
# imports
class Parser(object):
def __init__(self):
# load some configurations
# open some log files
# set some initial values for some variables
def parse_packet(self, tags):
# extract some values from line
def found_matched(self, packet):
# search in the related list to find matched line
def save_packet(self, packet):
# write the line in the appropriate files and increase or decrease some counters
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
file_list = dname[2]
for fname in file_list:
self.parse(os.path.join(dname[0], fname))
self.finalize_process()
def finalize_process(self):
# closing files
I want to decrease the time at least to the 10% percent of current execution time. Maybe multiprocessing can help me or just some enhancement in current script will do the task. Anyway could you please help me in this?
Edit 1:
I have changed my code according to #Reut Sharabani's answer:
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
process_pool = multiprocessing.Pool(10)
for fname in file_list:
file_list = [os.path.join(dname[0], fname) for fname in dname[2]]
process_pool.map(self.parse, file_list)
self.finalize_process()
I also added below lines before my class definition to avoid PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed:
import copy_reg
import types
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
Another thing that I have done into my code was not to keep open log files during file processing; I open and close them for writing each entry just to avoid ValueError: I/O operation on closed file.
Now the problem is that I have some files which are being processed multiple times. I also got wrong counts for my packets. What did I do wrong? Should I put process_pool = multiprocessing.Pool(10) before the for loop? Consider that I have just one directory right now and it doesn't seem to be the problem.
EDIT 2:
I also tried using ThreadPoolExecutor this way:
with ThreadPoolExecutor(max_workers=10) as executor:
for fname in file_list:
executor.submit(self.parse, fname)
Results were correct, but it took an hour and a half to be completed.

First of all, "about 650 files which takes about 320 M" is not a lot. Given that modern hard disks easily read and write 100 MB/s, the I/O performance of your system probably is not your bottleneck (also supported by "it just took about 0.3 seconds to run the script for one single file with 2000 lines", which clearly indicates CPU-limitation). However, the exact way you are reading files from within Python may not be efficient.
Furthermore, a simple multiprocessing-based architecture, run on a common multi core system, will allow you to perform your analysis much faster (no need to involve celery here, no need to cross machine boundaries).
multiprocessing architecture
Just have a look at multiprocessing, your architecture likely will involve one manager process (the parent), which defines a task Queue, and a Pool of worker processes. The manager (or feeder) puts tasks (e.g. file names) into the queue, and the workers consume these. After finishing with a task, a worker lets the manager know, and proceeds consuming the next one.
file processing method
This is quite inefficient:
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
...
readlines() reads the entire file before the list comprehension is evaluated. Only after that you again iterate through all lines. Hence, you iterate three times through your data. Combine everything into a single loop, so that you iterate the lines only once.

You should be using threads here. If you're blocked by cpu later, you can use processes.
To explain I first created a ten thousand files (0.txt ... 9999.txt), with a line count that's equivalent to the name (+1), using this command:
for i in `seq 0 999`; do for j in `seq 0 $i`; do echo $i >> $i.txt; done ; done
Next, I've created a python script using a ThreadPool with 10 threads to count the lines of all files that have an even value:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
import time
import sys
print "creating %s threads" % sys.argv[1]
thread_pool = ThreadPool(int(sys.argv[1]))
files = ["%d.txt" % i for i in range(1000)]
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f.readlines():
if int(line.strip()) % 2 == 0:
line_count += 1
print "finished file %s" % filename
return line_count
start = time.time()
print sum(thread_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
As you can see this takes no time, and the results are correct. 10 files are processed in parallel and the cpu is fast enough to handle the results. If you want even more you may consider using threads and processes to utilize all cpus as well as not letting IO block you.
Edit:
As comments suggest, I was wrong and this is not I/O blocked, so you can speed it up using multiprocessing (cpu blocked). Because I used a ThreadPool which has the same interface as Pool you can make minimal edits and have the same code running:
#!/usr/bin/env python
import multiprocessing
import time
import sys
files = ["%d.txt" % i for i in range(2000)]
# function has to be defined before pool is opened and workers are forked
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f:
if int(line.strip()) % 2 == 0:
line_count += 1
return line_count
print "creating %s processes" % sys.argv[1]
process_pool = multiprocessing.Pool(int(sys.argv[1]))
start = time.time()
print sum(process_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
Results:
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 1
creating 1 processes
25000000
21.2642059326
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 10
creating 10 processes
25000000
12.4360249043

Aside from using parallel processing, your parse method is rather inefficient as #Jan-PhilipGehrcke already pointed out. To expand on his recommendation: The classical variant:
def parse(self, file_addr):
with open(file_addr, 'r') as f:
line_no = 0
for line in f:
line_no += 1
if line_no % 2 != 0:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
Or using your style (assuming you use python 3):
def parse(self, file_addr):
with open(file_addr, 'r') as f:
filtered = (l for index,l in enumerate(f) if index % 2 != 0)
for line in filtered:
# and so on
The thing to notice here, is the use of iterators, all operations to build the filtered list (which is not actually a list!!) operate on and return iterators, which means that at no point the entire file is loaded into a list.

Python multiprocessing Pool.map not faster than calling the function once

I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?

If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.

Chunking data from a large file for multiprocessing?

I'm trying to a parallelize an application using multiprocessing which takes in
a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size
file.
Currently I do a list(file_obj), which unfortunately is loaded entirely
into memory (I think) and I then I break that list up into n parts, n being the
number of processes I want to run. I then do a pool.map() on the broken up
lists.
This seems to have a really, really bad runtime in comparison to a single
threaded, just-open-the-file-and-iterate-over-it methodology. Can someone
suggest a better solution?
Additionally, I need to process the rows of the file in groups which preserve
the value of a certain column. These groups of rows can themselves be split up,
but no group should contain more than one value for this column.

list(file_obj) can require a lot of memory when fileobj is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.
In particular, we can use
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
to split the file into processable chunks, and
groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)]
result = pool.map(worker, groups)
to have the multiprocessing pool work on num_chunks chunks at a time.
By doing so, we need roughly only enough memory to hold a few (num_chunks) chunks in memory, instead of the whole file.
import multiprocessing as mp
import itertools
import time
import csv
def worker(chunk):
# `chunk` will be a list of CSV rows all with the same name column
# replace this with your real computation
# print(chunk)
return len(chunk)
def keyfunc(row):
# `row` is one row of the CSV file.
# replace this with the name column.
return row[0]
def main():
pool = mp.Pool()
largefile = 'test.dat'
num_chunks = 10
results = []
with open(largefile) as f:
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
while True:
# make a list of num_chunks chunks
groups = [list(chunk) for key, chunk in
itertools.islice(chunks, num_chunks)]
if groups:
result = pool.map(worker, groups)
results.extend(result)
else:
break
pool.close()
pool.join()
print(results)
if __name__ == '__main__':
main()

I would keep it simple. Have a single program open the file and read it line by line. You can choose how many files to split it into, open that many output files, and every line write to the next file. This will split the file into n equal parts. You can then run a Python program against each of the files in parallel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel processing of large xml files in python - python

Related

I need to extract data from a text file that contains >8 million records. What is the best suited language to do this using Multithreading?

Handle multiple results in Python multiprocessing

Processing a huge amount of files in python

Python multiprocessing Pool.map not faster than calling the function once

Chunking data from a large file for multiprocessing?

Categories

Resources