I have a large file of 120GB consisting of strings line by line. I would like to loop the file line by line replacing all the German characters ß with characters s. I have a working code, but it is very slow, and in the future, I should be replacing more German characters. So I have been trying to cut the file in 6 pieces (for my 6-core CPU ) and incorporate multicore processing to speed the code up, but with no luck.
As lines are not ordered, I do not care where the lines in the new file will end up. Can somebody please help me?
My working slow code:
import re
with open('C:\Projects\orders.txt', 'r') as f, open('C:\Projects\orders_new.txt', 'w') as nf:
for l in f:
l = re.sub("ß", "s", l)
nf.write(l)
For a multiprocessing solution to be more performant than the equivalent single-processing one, the worker function must be sufficiently CPU-intensive such that running the function in parallel saves enough time to compensate for the additional overhead that multiprocessing incurs.
To make the worker function sufficiently CPU-intensive, I would batch up the lines to be translated into chunks so that each invocation of the worker function involves more CPU. You can play around with the CHUNK_SIZE value (read the comments that precedes its definition). If you have sufficient memory, the larger the better.
from multiprocessing import Pool
def get_chunks():
# If you have N processors,
# then we need memory to hold 2 * (N - 1) chunks (one processor
# is reserved for the main process).
# The size of a chunk is CHUNK_SIZE * average-line-length.
# If the average line length were 100, then a chunk would require
# approximately 1_000_000 bytes of memory.
# So if you had, for example, a 16MB machine with 8 processors,
# you would have more
# than enough memory for this CHUNK_SIZE.
CHUNK_SIZE = 1_000
with open('C:\Projects\orders.txt', 'r', encoding='utf-8') as f:
chunk = []
while True:
line = f.readline()
if line == '': # end of file
break
chunk.append(line)
if len(chunk) == CHUNK_SIZE:
yield chunk
chunk = []
if chunk:
yield chunk
def worker(chunk):
# This function must be sufficiently CPU-intensive
# to justify multiprocessing.
for idx in range(len(chunk)):
chunk[idx] = chunk[idx].replace("ß", "s")
return chunk
def main():
with Pool(multiprocessing.cpu_count() - 1) as pool, \
open('C:\Projects\orders_new.txt', 'w', encoding='utf-8') as nf:
for chunk in pool.imap_unordered(worker, get_chunks()):
nf.write(''.join(chunk))
"""
Or to be more memory efficient, but slower:
for line in chunk:
nf.write(chunk)
"""
if __name__ == '__main__':
main()
Related
I have several large xml files that I am parsing (extracting some subset of the data and writing to file), but there are lots of files and lots of records per file, so I'm attempting to parallelize.
To start, I have a generator that pulls records from the file (this works fine):
def reader(files):
n=0
for fname in files:
chunk = ''
with gzip.open(fname, "r") as f:
for line in f:
line = line.strip()
if '<REC' in line or ('<REC' in chunk and line != '</REC>'):
chunk += line
continue
if line == '</REC>':
chunk += line
n += 1
yield chunk
chunk = ''
A process function (details not relevant here, but also works fine):
def process(chunk,fields='all'):
paper = etree.fromstring(chunk)
#
# extract some info from the xml
#
return result # result is a string
Now of course the naive, non-parallel way to do this would be as simple as:
records = reader(files)
with open(output_filepath,'w') as fout:
for record in records:
result = process(record)
fout.write(result+'\n')
But now I want to parallelize this. I first considered doing a simple map-based approach, with each process handling one of the files, but the files are of radically different sizes (and some are really big), so this would be a pretty inefficient use of parallelization, I think. This is my current approach:
import multiprocessing as mp
def feed(queue, records):
for rec in records:
queue.put(rec)
queue.put(None)
def calc(queueIn, queueOut):
while True:
rec = queueIn.get(block=True)
if rec is None:
queueOut.put('__DONE__')
break
result = process(rec)
queueOut.put(result)
def write(queue, file_handle):
records_logged = 0
while True:
result = queue.get()
if result == '__DONE__':
logger.info("{} --> ALL records logged ({})".format(file_handle.name,records_logged))
break
elif result is not None:
file_handle.write(result+'\n')
file_handle.flush()
records_logged +=1
if records_logged % 1000 == 0:
logger.info("{} --> {} records complete".format(file_handle.name,records_logged))
nThreads = N
records = reader(filelist)
workerQueue = mp.Queue()
writerQueue = mp.Queue()
feedProc = mp.Process(target = feed , args = (workerQueue, records))
calcProc = [mp.Process(target = calc, args = (workerQueue, writerQueue)) for i in range(nThreads)]
writProc = mp.Process(target = write, args = (writerQueue, handle))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join()
for p in calcProc:
p.join()
writProc.join()
feedProc.terminate()
writProc.terminate()
for p in calcProc:
p.terminate()
workerQueue.close()
writerQueue.close()
Now, this works in the sense that everything gets written to file, but then it just hangs when trying to join the processes at the end, and I'm not sure why. So, my main question is, what am I doing wrong here such that my worker processes aren't correctly terminating, or signaling that they're done?
I think I could solve this problem the "easy" way by adding timeouts to the calls to join but this (a) seems like a rather inelegant solution here, as there are clear completion conditions to the task (i.e. once every record in the file has been processed, we're done), and (b) I'm worried that this could introduce some problem (e.g. if I make the timeout too short, couldn't it terminate things before everything has been processed? And of course making it too long is just wasting time...).
I'm also willing to consider a totally different approach to parallelizing this if anyone has ideas (the queue just seemed like a good choice since the files are big and just reading and generating the raw records takes time).
Bonus question: I'm aware that this approach in no way guarantees that the output I'm writing to file will be in the same order as the original data. This is not a huge deal (sorting the reduced/processed data won't be too unwieldy), but maintaining order would be nice. So extra gratitude if anyone has a solution that ensure that will preserve the original order.
I am working on a tool that generates random data for testing purposes. See below the part of my code that is giving me grief. This works perfectly and faster(takes around 20 seconds) than conventional solutions when the file is around 400MB, however, once it reaches around 500MB I am getting an out of memory error. How can I pull contents from the memory and write it in a file progressively having not more than 10 MB in my memory at a single time.
def createfile(filename,size_kb):
tbl = bytearray(range(256))
numrand = os.urandom(size_kb*1024)
with open(filename,"wb") as fh:
fh.write(numrand.translate(tbl))
createfile("file1.txt",500*1024)
Any help will be greatly appreciated
You can write out chunks of 10MB at a time, rather than generating the entire file in one go. As pointed out by #mhawke, the translate call is redundant and can be removed:
def createfile(filename,size_kb):
chunks = size_kb /(1024*10)
with open(filename,"wb") as fh:
for iter in range(chunks):
numrand = os.urandom(size_kb*1024 / chunks)
fh.write(numrand)
numrand = os.urandom(size_kb*1024 % chunks)
fh.write(numrand)
createfile("c:/file1.txt",500*1024)
comibining Jaco and mhawk and handling some float conversions.. here is the code that can generate Gbs of data in less than 10 seconds
def createfile(filename,size_kb):
chunksize = 1024
chunks = math.ceil(size_kb / chunksize)
with open(filename,"wb") as fh:
for iter in range(chunks):
numrand = os.urandom(int(size_kb*1024 / chunks))
fh.write(numrand)
numrand = os.urandom(int(size_kb*1024 % chunks))
fh.write(numrand)
Creates 1 Gb file in less than 8 seconds
I'm trying to a parallelize an application using multiprocessing which takes in
a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size
file.
Currently I do a list(file_obj), which unfortunately is loaded entirely
into memory (I think) and I then I break that list up into n parts, n being the
number of processes I want to run. I then do a pool.map() on the broken up
lists.
This seems to have a really, really bad runtime in comparison to a single
threaded, just-open-the-file-and-iterate-over-it methodology. Can someone
suggest a better solution?
Additionally, I need to process the rows of the file in groups which preserve
the value of a certain column. These groups of rows can themselves be split up,
but no group should contain more than one value for this column.
list(file_obj) can require a lot of memory when fileobj is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.
In particular, we can use
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
to split the file into processable chunks, and
groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)]
result = pool.map(worker, groups)
to have the multiprocessing pool work on num_chunks chunks at a time.
By doing so, we need roughly only enough memory to hold a few (num_chunks) chunks in memory, instead of the whole file.
import multiprocessing as mp
import itertools
import time
import csv
def worker(chunk):
# `chunk` will be a list of CSV rows all with the same name column
# replace this with your real computation
# print(chunk)
return len(chunk)
def keyfunc(row):
# `row` is one row of the CSV file.
# replace this with the name column.
return row[0]
def main():
pool = mp.Pool()
largefile = 'test.dat'
num_chunks = 10
results = []
with open(largefile) as f:
reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)
while True:
# make a list of num_chunks chunks
groups = [list(chunk) for key, chunk in
itertools.islice(chunks, num_chunks)]
if groups:
result = pool.map(worker, groups)
results.extend(result)
else:
break
pool.close()
pool.join()
print(results)
if __name__ == '__main__':
main()
I would keep it simple. Have a single program open the file and read it line by line. You can choose how many files to split it into, open that many output files, and every line write to the next file. This will split the file into n equal parts. You can then run a Python program against each of the files in parallel.
[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val
Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.
Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).
My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).
I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:
print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
currentLine += 1.0
lined = line.split('\t')
printEvery100000 += 1
if printEvery100000 == 100000:
print str(currentLine / totalLines)
printEvery100000 = 0;
lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'
EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D
Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.
It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:
1.5 GB file size
31,164,015 total lines
each line split into a list with 4 pieces
Code:
import sys
def sizeof(lst):
return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)
GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))
val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols
line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))
Returns:
avg line size: 312 bytes
approx. memory needed: 9.1 GB
I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).
This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.
You can find a file-seeking binary search algorithm in this question.
Hopefully a memory efficient way of sorting a massive file by line:
import os
from mmap import mmap
input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')
# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()
for line in input_file:
mm.seek(0)
tup = line.split("\t")
while True:
cur_loc = mm.tell()
o_line = mm.readline()
o_tup = o_line.split("\t")
if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
mm.resize(cur_size + len(line))
mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
mm[cur_loc:cur_loc+len(line)] = line
cur_size += len(line)
break