Chunking data from a large file for multiprocessing? - python

I'm trying to a parallelize an application using multiprocessing which takes in
a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size
file.
Currently I do a list(file_obj), which unfortunately is loaded entirely
into memory (I think) and I then I break that list up into n parts, n being the
number of processes I want to run. I then do a pool.map() on the broken up
lists.
This seems to have a really, really bad runtime in comparison to a single
threaded, just-open-the-file-and-iterate-over-it methodology. Can someone
suggest a better solution?
Additionally, I need to process the rows of the file in groups which preserve
the value of a certain column. These groups of rows can themselves be split up,
but no group should contain more than one value for this column.

list(file_obj) can require a lot of memory when fileobj is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.
In particular, we can use
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
to split the file into processable chunks, and
groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)]
result = pool.map(worker, groups)
to have the multiprocessing pool work on num_chunks chunks at a time.
By doing so, we need roughly only enough memory to hold a few (num_chunks) chunks in memory, instead of the whole file.
import multiprocessing as mp
import itertools
import time
import csv
def worker(chunk):
# `chunk` will be a list of CSV rows all with the same name column
# replace this with your real computation
# print(chunk)
return len(chunk)
def keyfunc(row):
# `row` is one row of the CSV file.
# replace this with the name column.
return row[0]
def main():
pool = mp.Pool()
largefile = 'test.dat'
num_chunks = 10
results = []
with open(largefile) as f:
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
while True:
# make a list of num_chunks chunks
groups = [list(chunk) for key, chunk in
itertools.islice(chunks, num_chunks)]
if groups:
result = pool.map(worker, groups)
results.extend(result)
else:
break
pool.close()
pool.join()
print(results)
if __name__ == '__main__':
main()

I would keep it simple. Have a single program open the file and read it line by line. You can choose how many files to split it into, open that many output files, and every line write to the next file. This will split the file into n equal parts. You can then run a Python program against each of the files in parallel.

Related

Python Multiprocessing read and write to large file

I have a large file of 120GB consisting of strings line by line. I would like to loop the file line by line replacing all the German characters ß with characters s. I have a working code, but it is very slow, and in the future, I should be replacing more German characters. So I have been trying to cut the file in 6 pieces (for my 6-core CPU ) and incorporate multicore processing to speed the code up, but with no luck.
As lines are not ordered, I do not care where the lines in the new file will end up. Can somebody please help me?
My working slow code:
import re
with open('C:\Projects\orders.txt', 'r') as f, open('C:\Projects\orders_new.txt', 'w') as nf:
for l in f:
l = re.sub("ß", "s", l)
nf.write(l)
For a multiprocessing solution to be more performant than the equivalent single-processing one, the worker function must be sufficiently CPU-intensive such that running the function in parallel saves enough time to compensate for the additional overhead that multiprocessing incurs.
To make the worker function sufficiently CPU-intensive, I would batch up the lines to be translated into chunks so that each invocation of the worker function involves more CPU. You can play around with the CHUNK_SIZE value (read the comments that precedes its definition). If you have sufficient memory, the larger the better.
from multiprocessing import Pool
def get_chunks():
# If you have N processors,
# then we need memory to hold 2 * (N - 1) chunks (one processor
# is reserved for the main process).
# The size of a chunk is CHUNK_SIZE * average-line-length.
# If the average line length were 100, then a chunk would require
# approximately 1_000_000 bytes of memory.
# So if you had, for example, a 16MB machine with 8 processors,
# you would have more
# than enough memory for this CHUNK_SIZE.
CHUNK_SIZE = 1_000
with open('C:\Projects\orders.txt', 'r', encoding='utf-8') as f:
chunk = []
while True:
line = f.readline()
if line == '': # end of file
break
chunk.append(line)
if len(chunk) == CHUNK_SIZE:
yield chunk
chunk = []
if chunk:
yield chunk
def worker(chunk):
# This function must be sufficiently CPU-intensive
# to justify multiprocessing.
for idx in range(len(chunk)):
chunk[idx] = chunk[idx].replace("ß", "s")
return chunk
def main():
with Pool(multiprocessing.cpu_count() - 1) as pool, \
open('C:\Projects\orders_new.txt', 'w', encoding='utf-8') as nf:
for chunk in pool.imap_unordered(worker, get_chunks()):
nf.write(''.join(chunk))
"""
Or to be more memory efficient, but slower:
for line in chunk:
nf.write(chunk)
"""
if __name__ == '__main__':
main()

Count number of chunks

I'm reading in a large csv file using chuncksize (pandas DataFrame), like so
reader = pd.read_csv('log_file.csv', low_memory = False, chunksize = 4e7)
I know I could just calculate the number of chunks with which it reads in the file but I would like to do it automatically and save the number of chunks into a variable, like so (in pseudo code)
number_of_chuncks = countChuncks(reader)
Any ideas?
You can use a generator expression to iterate through reader (a TextFileReader returned by read_csv when we define chunksize) and sum 1 for each iteration:
number_of_chunks = sum(1 for chunk in reader)
Alternatively, you can use a generator expression to count the number of lines in your file (similar logic of the first option, but iterating through the lines of the file), than divide this number by the chunksize and round up the result (with math.ceil)
import math
number_of_rows = sum(1 for row in open('log_file.csv', 'r'))
number_of_chunks = math.ceil(number_of_rows/chunksize)
or
import math
number_of_chunks = math.ceil(sum(1 for row in open('log_file.csv', 'r'))/chunksize)
In my tests, the second solution showed a better performance than the first.

Parallel processing of large xml files in python

I have several large xml files that I am parsing (extracting some subset of the data and writing to file), but there are lots of files and lots of records per file, so I'm attempting to parallelize.
To start, I have a generator that pulls records from the file (this works fine):
def reader(files):
n=0
for fname in files:
chunk = ''
with gzip.open(fname, "r") as f:
for line in f:
line = line.strip()
if '<REC' in line or ('<REC' in chunk and line != '</REC>'):
chunk += line
continue
if line == '</REC>':
chunk += line
n += 1
yield chunk
chunk = ''
A process function (details not relevant here, but also works fine):
def process(chunk,fields='all'):
paper = etree.fromstring(chunk)
#
# extract some info from the xml
#
return result # result is a string
Now of course the naive, non-parallel way to do this would be as simple as:
records = reader(files)
with open(output_filepath,'w') as fout:
for record in records:
result = process(record)
fout.write(result+'\n')
But now I want to parallelize this. I first considered doing a simple map-based approach, with each process handling one of the files, but the files are of radically different sizes (and some are really big), so this would be a pretty inefficient use of parallelization, I think. This is my current approach:
import multiprocessing as mp
def feed(queue, records):
for rec in records:
queue.put(rec)
queue.put(None)
def calc(queueIn, queueOut):
while True:
rec = queueIn.get(block=True)
if rec is None:
queueOut.put('__DONE__')
break
result = process(rec)
queueOut.put(result)
def write(queue, file_handle):
records_logged = 0
while True:
result = queue.get()
if result == '__DONE__':
logger.info("{} --> ALL records logged ({})".format(file_handle.name,records_logged))
break
elif result is not None:
file_handle.write(result+'\n')
file_handle.flush()
records_logged +=1
if records_logged % 1000 == 0:
logger.info("{} --> {} records complete".format(file_handle.name,records_logged))
nThreads = N
records = reader(filelist)
workerQueue = mp.Queue()
writerQueue = mp.Queue()
feedProc = mp.Process(target = feed , args = (workerQueue, records))
calcProc = [mp.Process(target = calc, args = (workerQueue, writerQueue)) for i in range(nThreads)]
writProc = mp.Process(target = write, args = (writerQueue, handle))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join()
for p in calcProc:
p.join()
writProc.join()
feedProc.terminate()
writProc.terminate()
for p in calcProc:
p.terminate()
workerQueue.close()
writerQueue.close()
Now, this works in the sense that everything gets written to file, but then it just hangs when trying to join the processes at the end, and I'm not sure why. So, my main question is, what am I doing wrong here such that my worker processes aren't correctly terminating, or signaling that they're done?
I think I could solve this problem the "easy" way by adding timeouts to the calls to join but this (a) seems like a rather inelegant solution here, as there are clear completion conditions to the task (i.e. once every record in the file has been processed, we're done), and (b) I'm worried that this could introduce some problem (e.g. if I make the timeout too short, couldn't it terminate things before everything has been processed? And of course making it too long is just wasting time...).
I'm also willing to consider a totally different approach to parallelizing this if anyone has ideas (the queue just seemed like a good choice since the files are big and just reading and generating the raw records takes time).
Bonus question: I'm aware that this approach in no way guarantees that the output I'm writing to file will be in the same order as the original data. This is not a huge deal (sorting the reduced/processed data won't be too unwieldy), but maintaining order would be nice. So extra gratitude if anyone has a solution that ensure that will preserve the original order.

Python multiprocessing Pool.map not faster than calling the function once

I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?
If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.

What is the optimal way to process a very large (over 30GB) text file and also show progress

[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val
Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

Categories

Resources