I am currently trying to remove the majority of lines from a large text file and rewrite the chosen information into another. I have to read the original file line-by-line as the order in which the lines appear is relevant. So far, the best approach I could think of pulled only the relevant lines and rewrote them using something like:
with open('input.txt', 'r') as input_file:
with open('output.txt', 'w') as output_file:
# We only have to loop through the large file once
for line in input_file:
# Looping through my data many times is OK as it only contains ~100 elements
for stuff in data:
# Search the line
line_data = re.search(r"(match group a)|(match group b)", line)
# Verify there is indeed a match to avoid raising an exception.
# I found using try/except was negligibly slower here
if line_data:
if line_data.group(1):
output_file.write('\n')
elif line_data.group(2) == stuff:
output_file.write('stuff')
output_file.close()
input_file.close()
However, this program still takes ~8 hours to run with a ~1Gb file and ~120,000 matched lines. I believe the bottleneck may involve either the regex or output bit as time taken to complete this script scales linearly with the number of line matches.
I have tried storing the output data first in memory before writing it to the new text file but a quick test showed that it was storing data at roughly the same speed as it was writing it before.
If it helps, I have a Ryzen 5 1500 and 8Gb of 2133 Mhz RAM. However, my RAM usage never seems to cap out.
You could move your inner loop to only run when needed. Right now, you're looping over data for every line in the large file, but only using the stuff variable when you match. So just move the for stuff in data: loop to inside the if block that actually uses it.
for line in input_file:
# Search the line
line_data = re.search(r"(match group a)|(match group b)", line)
# Verify there is indeed a match to avoid raising an exception.
# I found using try/except was negligibly slower here
if line_data:
for stuff in data:
if line_data.group(1):
output_file.write('\n')
elif line_data.group(2) == stuff:
output_file.write('stuff')
You're generating the regex for each line with consume a lot of CPU, you should compile the regex at the beginning of the search instead that would save some cycles.
Related
I'm running SageMath 9.0, on Windows 10 OS
I've read several similar questions (and answers) in this site. Mainly this one one reading from the 7th line, and this one on optimizing. But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.
I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters. Each line has a constant number of characters. Here are the actual first 5 lines:
J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...
For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format. The file has been computed and made available by Brendan McKay on its webpage here.
I need to check every graph for some properties. I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop). I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.
My current code reads the file line by line from start, and do some computation after reading each line :
with open(filename,'r') as file:
while True:
# Get next line from file
line = file.readline()
# if line is empty, end of file is reached
if not line:
print("End of Database Reached")
break
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
In order to be able to stop the code, or save the progress in case of crash, I was thinking of :
Every million line read (or so), save the progress in a specific file
When restarting the code, read the last saved value, and instead of using line = file.readline(), I would use itertool option, for line in islice(file, start_line, None).
so that my new code is
from itertools import islice
start_line = load('foo')
count = start_line
save_every_n_lines = 1000000
with open(filename,'r') as file:
for line in islice(file, start_line, None):
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
count +=1
if (count % save_every_n_lines )==0:
save(count,'foo')
The code does work, but I would like to understand if I can optimise it. I'm not a big fan of my if statement in my for loop.
Is the itertools.islice() the good option here ? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached". As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?
Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one ? this would get read of the if statement in my for loop.
I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs. Could that be a good option ?
Each line has a constant number of characters. So "jumping" might be feasible.
Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell. The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.
This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.
import os
import mmap
# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1
# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
for i in range(100):
f.write("R{: 10}\n".format(i).encode('ascii'))
f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("record 20", record)
# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("get record 20", get_record(data, 11))
# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
if stop is None:
stop = mmapped_file.size()/LINE_SZ
for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
yield mmapped_file[pos:pos+RECORD_SZ]
print("enum 6 to 8", [record for record in enum_records(data,6,9)])
del data
f.close()
If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do
def get_line(k, line_len):
with open('file') as f:
f.seek(k*line_len)
return next(f)
Alright,
I have a large (8 GB+) txt file containing legacy data most likely from a mainframe b/c it's all fixed fields that must be parsed line by line & character by character. Reading the file line by line works fine on a small sample, but doesn't scale beyond a few hundred MB's.
Essentially, I want to read the txt file in batches, say five million lines per batch, and then process each batch line by line.
That's what I wrote in Python, but for some reason, the code below ends up in an infinite loop when tested on a smaller file. I am a bit baffled that the break actually never gets triggered and the snapshot gets overwritten all the time. Any idea how to fix that?
# Python 3.x
def convert_txt_to_csv(path_to_txt, path_to_save_csv, column_names):
df = pd.DataFrame(columns=column_names)
chunksize = 5000 # 5000000 - 5 million batches for the big file
print("Add rows...")
with open(path_to_txt, 'r', encoding="ISO-8859-1") as file:
lines = True
cnt = 0
mil = 1
while lines:
lines = file.readlines(chunksize) # This guy should become False if there no more lines...
if not lines:
break # Double safety, if they're no more lines, escape the loop...
for line in lines:
process_line(line.replace('\n', ''), df, cnt)
cnt += 1
# save snapshot after each batch
df.to_csv(path_to_snapshot_csv)
print("Saved Snapshot: ", mil)
mil +=1
print("Process")
df = process(df)
print("Safe")
df.to_csv(path_to_save_csv)
print("Nr. of data: ", len(df.index))
SOLUTION:
The above code actually works, but the actual bug was that the snapshot line was incorrectly intended and got called after each line instead of after each batch thus creating the impression the loop would be stuck by re-creating the snapshot forever. Here a few optimizations I applied in the meantime:
1) For reasonably sized files w.o batching:
for line in file: # Don't use readline...
process_line(line)
2) For speeding up reading files:
Create a ramdisk and copy input file there.
3) For batching, the chunk parameter in readline is some kind of weird bytsize so, for example, 1500000 translates to 2995 lines read in a row.
With a ramdisk & batching, the processing is actually quite fast now. Thanks for all the valuable input & questions.
I have got had an issue.
I have a Python application that will be deployed in various places. So Mr Nasty will highly likely tinker with the app.
So the problem is security related. The app will receive a file (plain text) received from a remote source. The device has a very limited amount of RAM (Raspberry Pi).
It is very much possible to feed extremely large input to the script which would be a big trouble.
I want to avoid reading each line of the file "as is" but rather read just the first part of the line limited to eg. 44 bytes and ignore the rest.
So just for the sake of the case a very crude sample:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
This works, but in case a line is longer than 44 chars, the next read will be the rest of the line, or multiple 44 byte long parts of the same line even.
To demonstate:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaa \n',
'11111111111111111111111111111111111111111111',
'111111111111111111111111111111111111111\n',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'bbbbbbbbbbbbbbb\n',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccc\n',
'333333333333\n',
'dddddddddddddddddddd\n']
This wouldn't save me from reading the whole content to a variable and potentially causing a neat DOS.
I've thought that maybe using file.next() would jump to the next line.
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
if line != "":
lines.append(line.strip())
fh.next()
But this throws an error:
Traceback (most recent call last):
File "./test.py", line 7, in <module>
line = fh.readline(44)
ValueError: Mixing iteration and read methods would lose data
...of which I can't do much about.
I've read up on file.seek() but that really doesn't have any capability as such what so ever (by the docs).
Meanwhile, I was writing this article, I've actually figured it out myself. It's so simple it's almost embarrassing. But I thought I will finish the article and leave it for others whom may have the same issue.
So my solution:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
if '\n' not in line:
fh.readline()
So the output now looks like this:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'11111111111111111111111111111111111111111111',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'333333333333\n',
'dddddddddddddddddddd\n']
Which is the close enough.
I don't dare to say it's the best or a good solution, but it seems to do the job, and I'm not storing the redundant part of the lines in a variable at all.
But just for the sake of curiosity, I actually have a question.
As above:
fh.readline()
When you call such a method without redirecting its output to a variable or else, where does this store the input, and what's its lifetime (I mean when is it going to be destroyed if it's being stored at all)?
Thank you all for the inputs. I've learned a couple of useful things.
I don't really like the way as file.read(n) works, even though most of the solutions rely on it.
Thanks to you guys I've come up with an improved solution of my original one using only file.readline(n):
limit = 10
lineList = []
with open("linesfortest.txt", "rb") as fh:
while True:
line = fh.readline(limit)
if not line:
break
if line.strip() != "":
lineList.append(line.strip())
while '\n' not in line:
line = fh.readline(limit)
print(lineList)
If my thinking is correct, the inner while loop will read the same chunks of the line until it reads the EOL char, and meanwhile, it will use only a sized variable again and again.
And that provides an output:
['"Alright,"',
'"You\'re re',
'"Tell us!"',
'"Alright,"',
'Question .',
'"The Answe',
'"Yes ...!"',
'"Of Life,',
'"Yes ...!"',
'"Yes ...!"',
'"Is ..."',
'"Yes ...!!',
'"Forty-two']
From the content of
"Alright," said the computer and settled into silence again. The two men fidgeted. The tension was unbearable.
"You're really not going to like it," observed Deep Thought.
"Tell us!"
"Alright," said Deep Thought.
Question ..."
"The Answer to the Great
"Yes ...!"
"Of Life, the Universe and Everything ..." said Deep Thought
"Yes ...!" "Is ..." said Deep Thought, and paused.
"Yes ...!"
"Is ..."
"Yes ...!!!...?"
"Forty-two," said Deep Thought, with infinite majesty and calm.
When you just do:
f.readline()
a line is read from the file, and a string is allocated, returned, then discarded.
If you have very large lines, you could run out of memory (in the allocation/reallocation phase) just by calling f.readline() (it happens when some files are corrupt) even if you don't store the value.
Limiting the size of the line works, but if you call f.readline() again, you get the remainder of the line. The trick would be to skip the remaining chars until a line termination char is found. A simple standalone example of how I'd do:
max_size = 20
with open("test.txt") as f:
while True:
l = f.readline(max_size)
if not l:
break # we reached the end of the file
if l[-1] != '\n':
# skip the rest of the line
while True:
c = f.read(1)
if not c or c == "\n": # end of file or end of line
break
print(l.rstrip())
That example reads the start of a line, and if the line has been truncated (when it doesn't end by a line termination, that is), I read the rest of the line, discarding it. Even if the line is very long, it doesn't consume memory. It's just dead slow.
About combining next() and readline(): those are concurrent mechanisms (manual iteration vs classical line read) and they mustn't be mixed because the buffering of one method may be ignored by the other one. But you can mix read() and readline(), for loop and next().
Try like this:
'''
$cat test.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
'''
from time import sleep # trust me on this one
lines = []
with open("test.txt", "r") as fh:
while True:
line = fh.readline(44)
print (line.strip())
if not line:
#sleep(0.05)
break
lines.append(line.strip())
if not line.endswith("\n"):
while fh.readline(1) != "\n":
pass
print(lines)
Quite simple, it will read 44 characters, and if its not ending in new line it will read 1 character at the time till it gets to it to avoid large chunks into the memory, only then will it go to process next 44 characters and append them to the list.
Dont forget to use line.strip() to avoid getting \n as a part of the string when its shorter than 44 characters.
I'm going to assume you're asking your original question here, and not your side question about temporary values (which Jean-François Fabre has already answered nicely).
Your existing solution doesn't actually solve your problem.
Let's say your attacker creates a line that's 100 million characters long. So:
You do a fh.readline(44), which reads the first 44 characters.
Then you do a fh.readline() to discard the rest of the line. This has to read the rest of the line into a string to discard it, so it uses up 100MB.
You could handle this by reading one character at a time in a loop until '\n', but there's a better solution: just fh.readline(44) in a loop until '\n'. Or maybe fh.readline(8192) or something—temporarily wasting 8KB (it's effectively the same 8KB being used over and over) isn't going to help your attacker.
For example:
while True:
line = fh.readline(20)
if not line:
break
lines.append(line.strip())
while line and not line.endswith('\n'):
line = fh.readline(8192)
In practice, this isn't going to be that much more efficient. A Python 2.x file object wraps a C stdio FILE, which already has a buffer, and with the default arguments to open, it's a buffer chosen by your platform. Let's say your platform uses 16KB.
So, whether you read(1) or readline(8192), it's actually reading 16KB at a time off disk into some hidden buffer, and just copying 1 or 8192 characters out of that buffer into a Python string.
And, while it obviously takes more time to loop 16384 times and build 16384 tiny strings than to loop twice and build two 8K strings, that time is still probably smaller than the disk I/O time.
So, if you understand the read(1) code better and can debug and maintain it more easily, just do that.
However, there might be a better solution here. If you're on a 64-bit platform, or your largest possible file is under 2GB (or it's acceptable for a file >2GB to raise an error before you even process it), you can mmap the file, then search it as if it were a giant string in memory:
from contextlib import closing
import mmap
lines = []
with open('ready.py') as f:
with closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
start = 0
while True:
end = m.find('\n', start)
if end == -1:
lines.append(m[start:start+44])
break
lines.append(m[start:min(start+44, end)])
start = end + 1
This maps the whole file into virtual memory, but most of that virtual memory is not mapped to physical memory. Your OS will automatically take care of paging it in and out as needed to fit well within your resources. (And if you're worried about "swap hell": swapping out an unmodified page that's already backed by a disk file is essentially instantaneous, so that's not an issue.)
For example, let's say you've got a 1GB file. On a laptop with 16GB of RAM, it'll probably end up with the whole file mapped into 1GB of contiguous memory by the time you reach the end, but that's also probably fine. On a resource-constrained system with 128MB of RAM, it'll start throwing out the least recently used pages, and it'll end up with just the last few pages of the file mapped into memory, which is also fine. The only difference is that, if you then tried to print m[0:100], the laptop would be able to do it instantaneously, while the embedded box would have to reload the first page into memory. Since you're not doing that kind of random access through the file, that doesn't come up.
This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.
Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.
If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB
I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()