Optimize python program to parse two large files at the same time

Optimize python program to parse two large files at the same time - python

I am trying to parse two large files with Python3 at the same time. As shown here:
dict = {}
row = {}
with open(file1, "r") as f1, open(file2, "r") as f2:
zipped = zip(f1, f2)
for line_f1, line_f2 in zipped:
# parse the lines and save the line information in a dictionary
row = {"ID_1":line_f1[0], "ID_2":line_f2[0], ...}
# This process takes roughly 0.0005s each time
# it parses each pair of lines at once and returns an output
# it doesn't depend on previous lines or lines after
output = process(row)
# output is a string, add it to dict
if output in dict:
dict[output] += 1
else:
dict[output] = 1
return dict
When I tested the above code with two smaller text files (30,000 lines each, file size = 13M) and it takes roughly 150s to finish the loop.
When I tested with two large text files (9,000,000 lines each, file size = 3.8G) without the process step in the loop it takes roughly 670s.
When I tested with the same two large text files with the process step. I timed that for every 10,000 items it will take roughly 60s. The time didn't grow when the number of iterations gets large.
However, when I submit this job to a shared cluster it takes more than 36 hours for one pair of large files to finish processing. I am trying to figure out if there is any other way to process the files so it can be faster. Any suggestions would be appreciated.
Thanks in advance!

This is just a hypothesis, but your process could be wasting its allocated CPU slot every time it triggers an I/O to get a pair of lines. You could try reading groups of lines at a time and processing in chunks so you can make the most of each CPU time slot you get on the shared cluster.
from collections import deque
chunkSize = 1000000 # number of characters in each chunk (you will need to adjust this)
chunk1 = deque([""]) #buffered lines from 1st file
chunk2 = deque([""]) #buffered lines from 2nd file
with open(file1, "r") as f1, open(file2, "r") as f2:
while chunk1 and chunk2:
line_f1 = chunk1.popleft()
if not chunk1:
line_f1,*more = (line_f1+file1.read(chunkSize)).split("\n")
chunk1.extend(more)
line_f2 = chunk2.popleft()
if not chunk2:
line_f2,*more = (line_f2+file2.read(chunkSize)).split("\n")
chunk2.extend(more)
# process line_f1, line_f2
....
The way this works is by reading a chunk of characters (which must be larger than your longest line) and breaking it down into lines. The lines are placed in a queue for processing.
Because the chunksize is expressed in number of characters, the last line in the queue may be incomplete.
To ensure that lines are complete before being processed, another chunk is read when we get to the last line in the queue. The additional characters are added to the end of the incomplete line and the line splitting is performed on the combined string. Because we concatenated the last (incomplete) line, the .split("\n") function always applies to a chunk of text that begins at a line boundary.
The process continues with the (now completed) last line and the rest of the lines are added to the queue.

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a .txt file that looks like this (let's name it "myfile.txt"):
28807644'~'0'~'Maun FCU'~'US#####28855353'~'0'~'WNB Holdings LLC'~'US#####29212330'~'0'~'Idaho First Bank'~'US#####29278777'~'0'~'Republic Bank of Arizona'~'US#####29633181'~'0'~'Friendly Hills Bank'~'US#####29760145'~'0'~'The Freedom Bank of Virginia'~'US#####100504846'~'0'~'Community First Fund Federal Credit Union'~'US#####
I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:
import pandas as pd
import time
#time at the start of program is noted
start = time.time()
# We set the path where our file is located and read it
path = r'myfile.txt'
f = open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#####", "\n").replace("'", "")
# We read everything in columns with the separator "~"
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
#total time taken to print the file
print("Execution time in seconds: ",(end - start))
This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.
I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:
import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
write = csv.writer(outfile)
write.writerows(split_content)
Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?
I'd be happy to read any suggestions you may have, thanks in advance!

I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8...) to get a test file that is 31GB.
Following #Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes, with a peak RAM usage depending on the chunk size.
The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.
My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:
CHUNK_SZ = 1024 * 1024
FS = "'~'"
RS = '#####'
# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['####', '###', '##', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS = PARTIAL_FSES + PARTIAL_RSES
f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')
f_in = open('my_file.txt')
line = ''
while True:
# Read chunks till no more, then break out
chunk = f_in.read(CHUNK_SZ)
if not chunk:
break
# Any previous partial separator, plus new chunk
line += chunk
# Check end-of-line for a partial FS or RS; only when separators are more than one char
final_partial = ''
if line.endswith(FS) or line.endswith(RS):
pass # Write-out will replace complete FS or RS
else:
for partial in ALL_PARTIALS:
if line.endswith(partial):
final_partial = partial
line = line[:-len(partial)]
break
# Process/write chunk
f_out.write(line
.replace(FS, ',')
.replace(RS, '\n'))
# Add partial back, to be completed next chunk
line = final_partial
# Clean up
f_in.close()
f_out.close()

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:
def process(fn_in, fn_out, columns):
new_line = b'#####'
with open(fn_out, 'wb') as f_out:
# write the header
f_out.write((','.join(columns)+'\n').encode())
i = 0
with open(fn_in, "rb") as f_in:
while (b := f_in.read(1)):
if ord(b) == new_line[i]:
# keep matching the newline block
i += 1
if i == len(new_line):
# if matched entirely, write just a newline
f_out.write(b'\n')
i = 0
# write nothing while matching
continue
elif i > 0:
# if you reach this, it was a partial match, write it
f_out.write(new_line[:i])
i = 0
if b == b"'":
pass
elif b == b"~":
f_out.write(b',')
else:
# write the byte if no match
f_out.write(b)
process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])
That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.
This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.
Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.

Just to share an alternative way, based on convtools (table docs | github).
This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader).
Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.
from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table
def get_rows(filename):
with open(filename, "r") as f:
for row in split_buffer(f, "#####"):
yield row.replace("'", "")
Table.from_csv(
get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

Replacing string with id using dictionary in python

I have a dictionary file that contains a word in each line.
titles-sorted.txt
a&a
a&b
a&c_bus
a&e
a&f
a&m
....
For each word, its line number is the word's id.
Then I have another file that contains a set of words separated by tab in each line.
a.txt
a_15 a_15_highway_(sri_lanka) a_15_motorway a_15_motorway_(germany) a_15_road_(sri_lanka)
I'd like to replace all of the words by id if it exists in the dictionary, so that the output looks like,
3454 2345 123 5436 322 ....
So I wrote such python code to do this:
f = open("titles-sorted.txt")
lines = f.readlines()
titlemap = {}
nr = 1
for l in lines:
l = l.replace("\n", "")
titlemap[l.lower()] = nr
nr+=1
fw = open("a.index", "w")
f = open("a.txt")
lines = f.readlines()
for l in lines:
tokens = l.split("\t")
if tokens[0] in titlemap.keys():
fw.write(str(titlemap[tokens[0]]) + "\t")
for t in tokens[1:]:
if t in titlemap.keys():
fw.write(str(titlemap[t]) + "\t")
fw.write("\n")
fw.close()
f.close()
But this code is ridiculously slow, so it makes me suspicious if I have done everything right.
Is this an efficient way to do this?

The write loop contains a lot of calls to write, which are usually inefficient. You can probably speed things up by writing only once per line (or once per file if the file is small enough)
tokens = l.split("\t")
fw.write('\t'.join(fw.write(str(titlemap[t])) for t in tokens if t in titlemap)
fw.write("\n")
or even:
lines = []
for l in f:
lines.append('\t'.join(fw.write(str(titlemap[t])) for t in l.split('\t') if t in titlemap)
fw.write('\n'.join(lines))
Also, if your tokens are used more than once, you can save time by converting them to string when you read then:
titlemap = {l.strip().lower(): str(index) for index, l in enumerate(f, start=1)}

So, I suspect this differs based on the operating system you're running on and the specific python implementation (someone wiser than I may be able to provide some clarify here), but I have a suspicion about what is going on:
Every time you call write, some amount of your desired write request gets written to a buffer, and then once the buffer is full, this information is written to file. The file needs to be fetched from your hard disk (as it doesn't exist in main memory). So your computer pauses while it waits the several milliseconds that it takes to fetch the block from the harddisk and writes to it. On the other hand, you can do the parsing of the string and the lookup to your hashmap in a couple of nanoseconds, so you spend a lot of time waiting for the write request to finish!
Instead of writing immediately, what if you instead kept a list of the lines that you wanted to write and then only wrote them at the end, all in a row, or if you're handling a huge, huge file that will exceed the capacity of your main memory, write it once you have parsed a certain number of lines.
This allows the writing to disk to be optimized, as you can write multiple blocks at a time (again, this depends on how Python and the operating system handle the write call).

If we apply the suggestions so far and clean up your code some more (e.g. remove unnecessary .keys() calls), is the following still too slow for your needs?
title_map = {}
token_file = open("titles-sorted.txt")
for number, line in enumerate(token_file):
title_map[line.rstrip().lower()] = str(number + 1)
token_file.close()
input_file = open("a.txt")
output_file = open("a.index", "w")
for line in input_file:
tokens = line.split("\t")
if tokens[0] in title_map:
output_list = [title_map[tokens[0]]]
output_list.extend(title_map[token] for token in tokens[1:] if token in title_map)
output_file.write("\t".join(output_list) + "\n")
output_file.close()
input_file.close()
If it's still too slow, give us slightly more data to work with including an estimate of the number of lines in each of your two input files.

Splitting a CSV file into equal parts?

I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the cores work on the file together. However, I am having trouble even splitting the file into parts. I've looked all over google and I found some sample code that appears to do what I want. Here is what I have so far:
def split(infilename, num_cpus=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
total_file_size = os.path.getsize(infilename)
print total_file_size
files = list()
with open(infilename, 'rb') as infile:
for i in xrange(num_cpus):
files.append(tempfile.TemporaryFile())
this_file_size = 0
while this_file_size < 1.0 * total_file_size / num_cpus:
files[-1].write(infile.read(READ_BUFFER))
this_file_size += READ_BUFFER
files[-1].write(infile.readline()) # get the possible remainder
files[-1].seek(0, 0)
return files
files = split("sample_simple.csv")
print len(files)
for ifile in files:
reader = csv.reader(ifile)
for row in reader:
print row
The two prints show the correct file size and that it was split into 4 pieces (my system has 4 CPU cores).
However, the last section of the code that prints all the rows in each of the pieces gives the error:
for row in reader:
_csv.Error: line contains NULL byte
I tried printing the rows without running the split function and it prints all the values correctly. I suspect the split function has added some NULL bytes to the resulting 4 file pieces but I'm not sure why.
Does anyone know if this a correct and fast method to split the file? I just want resulting pieces that can be read successfully by csv.reader.

As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.
The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.
Update
This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.
import csv
import multiprocessing
import os
import tempfile
def split(infilename, num_chunks=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'in_file_size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'target chunk_size:', chunk_size
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
temp_file.write(infile.next())
except StopIteration: # end of infile
break
temp_file.seek(0) # rewind
files.append(temp_file)
return files
files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))
for i, ifile in enumerate(files, start=1):
print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
print 'contents of file {}:'.format(i)
reader = csv.reader(ifile)
for row in reader:
print row
print ''

Python: Seeking to EOL in file not working

I have this method:
def get_chunksize(path):
"""
Breaks a file into chunks and yields the chunk sizes.
Number of chunks equals the number of available cores.
Ensures that each chunk ends at an EOL.
"""
size = os.path.getsize(path)
cores = mp.cpu_count()
chunksize = size/cores # gives truncated integer
f = open(path)
while 1:
start = f.tell()
f.seek(chunksize, 1) # Go to the next chunk
s = f.readline() # Ensure the chunk ends at the end of a line
yield start, f.tell()-start
if not s:
break
It is supposed to break a file into chunks and return the start of the chunk (in bytes) and the chunk size.
Crucially, the end of a chunk should correspond to the end of a line (which is why the f.readline() behaviour is there), but I am finding that my chunks are not seeking to an EOL at all.
The purpose of the method is to then read chunks which can be passed to a csv.reader instance (via StringIO) for further processing.
I've been unable to spot anything obviously wrong with the function...any ideas why it is not moving to the EOL?
I came up with this rather clunky alternative:
def line_chunker(path):
size = os.path.getsize(path)
cores = mp.cpu_count()
chunksize = size/cores # gives truncated integer
f = open(path)
while True:
part = f.readlines(chunksize)
yield csv.reader(StringIO("".join(part)))
if not part:
break
This will split the file into chunks with a csv reader for each chunk, but the last chunk is always empty (??) and having to join the list of strings back together is rather clunky.

if not s:
break
Instead of looking at s to see if you're at the end of the file, you should look if you've reached the end of the file by using:
if size == f.tell(): break
this should fix it. I wouldn't depend on a CSV file having a single record per line though. I've worked with several CSV files that have strings with new-lines:
first,last,message
sue,ee,hello
bob,builder,"hello,
this is some text
that I entered"
jim,bob,I'm not so creative...
Notice the 2nd record (bob) spans across 3 lines. csv.reader can handle this. If the idea is to do some cpu intensive work on a csv. I'd create an array of threads, each with a buffer of n records. have the csv.reader pass a record to each thread using round-robin, skipping a thread if its buffer is full.
Hope this helps - enjoy.

How can I read large text files line by line, without loading them into memory? [duplicate]

This question already has answers here:
How should I read a file line-by-line in Python?
(3 answers)
Closed 7 months ago.
The community reviewed whether to reopen this question 6 months ago and left it closed:
Original close reason(s) were not resolved
I want to read a large file (>5GB), line by line, without loading its entire contents into memory. I cannot use readlines() since it creates a very large list in memory.

Use a for loop on a file object to read it line-by-line. Use with open(...) to let a context manager ensure that the file is closed after reading:
with open("log.txt") as infile:
for line in infile:
print(line)

All you need to do is use the file object as an iterator.
for line in open("log.txt"):
do_something_with(line)
Even better is using context manager in recent Python versions.
with open("log.txt") as fileobject:
for line in fileobject:
do_something_with(line)
This will automatically close the file as well.

Please try this:
with open('filename','r',buffering=100000) as f:
for line in f:
print line

An old school approach:
fh = open(file_name, 'rt')
line = fh.readline()
while line:
# do stuff with line
line = fh.readline()
fh.close()

You are better off using an iterator instead.
Relevant: fileinput — Iterate over lines from multiple input streams.
From the docs:
import fileinput
for line in fileinput.input("filename", encoding="utf-8"):
process(line)
This will avoid copying the whole file into memory at once.

Here's what you do if you dont have newlines in the file:
with open('large_text.txt') as f:
while True:
c = f.read(1024)
if not c:
break
print(c,end='')

I couldn't believe that it could be as easy as #john-la-rooy's answer made it seem. So, I recreated the cp command using line by line reading and writing. It's CRAZY FAST.
#!/usr/bin/env python3.6
import sys
with open(sys.argv[2], 'w') as outfile:
with open(sys.argv[1]) as infile:
for line in infile:
outfile.write(line)

The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.
dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
df.head(10) # return first 10 rows
df.tail(10) # return last 10 rows
# iterate rows
for idx, row in df.iterrows():
...
# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()
# slice by column
df[df.my_field=='XYZ'].compute()

Heres the code for loading text files of any size without causing memory issues.
It support gigabytes sized files
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code
usage
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(data, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)
process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

How about this?
Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line. If you are reading the file line by line, you are not making efficient use of the cached information.
Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.
def chunks(file,size=1024):
while 1:
startat=fh.tell()
print startat #file's object current position from the start
fh.seek(size,1) #offset from current postion -->1
data=fh.readline()
yield startat,fh.tell()-startat #doesnt store whole list in memory
if not data:
break
if os.path.isfile(fname):
try:
fh=open(fname,'rb')
except IOError as e: #file --> permission denied
print "I/O error({0}): {1}".format(e.errno, e.strerror)
except Exception as e1: #handle other exceptions such as attribute errors
print "Unexpected error: {0}".format(e1)
for ele in chunks(fh):
fh.seek(ele[0])#startat
data=fh.read(ele[1])#endat
print data

Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.
Then I had to remove a "=\n" in the middle of each line.
Then I split the lines at the new line.
b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
data_list = data_chunk.split('\n') #List containing lines in chunk
#print(data_list,'\n')
#time.sleep(1)
for j in range(len(data_list)): #iterate through data_list to get each item
i += 1
line_of_data = data_list[j]
print(line_of_data)
Here is the code starting just above "print data" in Arohi's code.

The best solution I found regarding this, and I tried it on 330 MB file.
lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
file.seek(lineno * (line_length + 2))
print(file.readline(), end='')
Where line_length is the number of characters in a single line. For example "abcd" has line length 4.
I have added 2 in line length to skip the '\n' character and move to the next character.

I realise this has been answered quite some time ago, but here is a way of doing it in parallel without killing your memory overhead (which would be the case if you tried to fire each line into the pool). Obviously swap the readJSON_line2 function out for something sensible - its just to illustrate the point here!
Speedup will depend on filesize and what you are doing with each line - but worst case scenario for a small file and just reading it with the JSON reader, I'm seeing similar performance to the ST with the settings below.
Hopefully useful to someone out there:
def readJSON_line2(linesIn):
#Function for reading a chunk of json lines
'''
Note, this function is nonsensical. A user would never use the approach suggested
for reading in a JSON file,
its role is to evaluate the MT approach for full line by line processing to both
increase speed and reduce memory overhead
'''
import json
linesRtn = []
for lineIn in linesIn:
if lineIn.strip() != 0:
lineRtn = json.loads(lineIn)
else:
lineRtn = ""
linesRtn.append(lineRtn)
return linesRtn
# -------------------------------------------------------------------
if __name__ == "__main__":
import multiprocessing as mp
path1 = "C:\\user\\Documents\\"
file1 = "someBigJson.json"
nBuffer = 20*nCPUs # How many chunks are queued up (so cpus aren't waiting on processes spawning)
nChunk = 1000 # How many lines are in each chunk
#Both of the above will require balancing speed against memory overhead
iJob = 0 #Tracker for SMP jobs submitted into pool
iiJob = 0 #Tracker for SMP jobs extracted back out of pool
jobs = [] #SMP job holder
MTres3 = [] #Final result holder
chunk = []
iBuffer = 0 # Buffer line count
with open(path1+file1) as f:
for line in f:
#Send to the chunk
if len(chunk) < nChunk:
chunk.append(line)
else:
#Chunk full
#Don't forget to add the current line to chunk
chunk.append(line)
#Then add the chunk to the buffer (submit to SMP pool)
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#Clear the chunk for the next batch of entries
chunk = []
#Buffer is full, any more chunks submitted would cause undue memory overhead
#(Partially) empty the buffer
if iBuffer >= nBuffer:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iBuffer -=1
iiJob+=1
#Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
if chunk:
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#And gather up the last of the buffer, including the final chunk
while iiJob < iJob:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iiJob+=1
#Cleanup
del chunk, jobs, temp1
pool.close()

This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.
def readInChunks(fileObj, chunkSize=1024):
while True:
data = fileObj.read(chunkSize)
if not data:
break
while data[-1:] != '\n':
data+=fileObj.read(1)
yield data

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimize python program to parse two large files at the same time - python

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

Replacing string with id using dictionary in python

Splitting a CSV file into equal parts?

Python: Seeking to EOL in file not working

How can I read large text files line by line, without loading them into memory? [duplicate]

Categories

Resources