Large File handling in Python

Large File handling in Python - python

I am doing some file operations in python. I am using python version Python 3.5.2.
I have a large file of 4GB. And I'am reading the file in chunk say, 2KB.
I have a doubt.
If the any 2KB chunk happens to be in the middle of line (in between 2 newlines) will that line be truncated or the half-read lines' contents be returned ?
-Regards,

Yes, this is a problem. You can see that with a much smaller test:
s = io.BytesIO(b'line\nanother line\nanother\n')
while True:
buf = s.read(10)
if not buf: break
print('*** new buffer')
for line in buf.splitlines():
print(line.decode())
The output is:
*** new buffer
line
anoth
*** new buffer
er line
an
*** new buffer
other
As you can see, the first buffer has a truncated partial line that finishes in the next buffer, exactly what you were worried about. In fact, this will happen not just occasionally, but _most of the time).
The solution is to keep around the overflow (after the last line) from the old buffer, and use it as part of the new buffer. You should try to code this up for yourself, to make sure you understand it (remember to print out the leftover overflow at the end of the file).
But the good news is that you rarely need to do this, because Python file objects do it for you:
s = io.BytesIO(b'line\nanother line\nanother\n')
for line in s:
print(line.decode(), end='')
That's it. You can test this with a real file from open(path, 'rb') in place of BytesIO, it works just as well. Python will read in about a page at a time, and generate lines one by one, automatically handling all the tricky stuff for you. If "about a page" isn't good enough, you can use something more explicit, e.g., passing buffering=2048 to the open function.
In fact, you can do even better. Open the file in text mode, and Python will still read in about a page at a time, split it into lines, and decode them for you on the fly—and probably a lot more efficiently than anything you would have come up with:
for line in open(path):
print(line, end='')

Related

Python stops printing to output file [duplicate]

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)

Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.

You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.

You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())

You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.

This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.

File Handler to be flushed.
f.flush()

The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

Ignore the rest of the line read after using file.readline(size)

I have got had an issue.
I have a Python application that will be deployed in various places. So Mr Nasty will highly likely tinker with the app.
So the problem is security related. The app will receive a file (plain text) received from a remote source. The device has a very limited amount of RAM (Raspberry Pi).
It is very much possible to feed extremely large input to the script which would be a big trouble.
I want to avoid reading each line of the file "as is" but rather read just the first part of the line limited to eg. 44 bytes and ignore the rest.
So just for the sake of the case a very crude sample:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
This works, but in case a line is longer than 44 chars, the next read will be the rest of the line, or multiple 44 byte long parts of the same line even.
To demonstate:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'aaaaaaaaaaaaaaaaaaaaaaaaa \n',
'11111111111111111111111111111111111111111111',
'111111111111111111111111111111111111111\n',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'bbbbbbbbbbbbbbb\n',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccccccccccccccccccccccccccccccccccccccccccc',
'cccc\n',
'333333333333\n',
'dddddddddddddddddddd\n']
This wouldn't save me from reading the whole content to a variable and potentially causing a neat DOS.
I've thought that maybe using file.next() would jump to the next line.
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
if line != "":
lines.append(line.strip())
fh.next()
But this throws an error:
Traceback (most recent call last):
File "./test.py", line 7, in <module>
line = fh.readline(44)
ValueError: Mixing iteration and read methods would lose data
...of which I can't do much about.
I've read up on file.seek() but that really doesn't have any capability as such what so ever (by the docs).
Meanwhile, I was writing this article, I've actually figured it out myself. It's so simple it's almost embarrassing. But I thought I will finish the article and leave it for others whom may have the same issue.
So my solution:
lines = []
with open("path/to/file.txt", "r") as fh:
while True:
line = fh.readline(44)
if not line:
break
lines.append(line)
if '\n' not in line:
fh.readline()
So the output now looks like this:
print(lines)
['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'11111111111111111111111111111111111111111111',
'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
'22222222222222222222222222222222222222222\n',
'cccccccccccccccccccccccccccccccccccccccccccc',
'333333333333\n',
'dddddddddddddddddddd\n']
Which is the close enough.
I don't dare to say it's the best or a good solution, but it seems to do the job, and I'm not storing the redundant part of the lines in a variable at all.
But just for the sake of curiosity, I actually have a question.
As above:
fh.readline()
When you call such a method without redirecting its output to a variable or else, where does this store the input, and what's its lifetime (I mean when is it going to be destroyed if it's being stored at all)?
Thank you all for the inputs. I've learned a couple of useful things.
I don't really like the way as file.read(n) works, even though most of the solutions rely on it.
Thanks to you guys I've come up with an improved solution of my original one using only file.readline(n):
limit = 10
lineList = []
with open("linesfortest.txt", "rb") as fh:
while True:
line = fh.readline(limit)
if not line:
break
if line.strip() != "":
lineList.append(line.strip())
while '\n' not in line:
line = fh.readline(limit)
print(lineList)
If my thinking is correct, the inner while loop will read the same chunks of the line until it reads the EOL char, and meanwhile, it will use only a sized variable again and again.
And that provides an output:
['"Alright,"',
'"You\'re re',
'"Tell us!"',
'"Alright,"',
'Question .',
'"The Answe',
'"Yes ...!"',
'"Of Life,',
'"Yes ...!"',
'"Yes ...!"',
'"Is ..."',
'"Yes ...!!',
'"Forty-two']
From the content of
"Alright," said the computer and settled into silence again. The two men fidgeted. The tension was unbearable.
"You're really not going to like it," observed Deep Thought.
"Tell us!"
"Alright," said Deep Thought.
Question ..."
"The Answer to the Great
"Yes ...!"
"Of Life, the Universe and Everything ..." said Deep Thought
"Yes ...!" "Is ..." said Deep Thought, and paused.
"Yes ...!"
"Is ..."
"Yes ...!!!...?"
"Forty-two," said Deep Thought, with infinite majesty and calm.

When you just do:
f.readline()
a line is read from the file, and a string is allocated, returned, then discarded.
If you have very large lines, you could run out of memory (in the allocation/reallocation phase) just by calling f.readline() (it happens when some files are corrupt) even if you don't store the value.
Limiting the size of the line works, but if you call f.readline() again, you get the remainder of the line. The trick would be to skip the remaining chars until a line termination char is found. A simple standalone example of how I'd do:
max_size = 20
with open("test.txt") as f:
while True:
l = f.readline(max_size)
if not l:
break # we reached the end of the file
if l[-1] != '\n':
# skip the rest of the line
while True:
c = f.read(1)
if not c or c == "\n": # end of file or end of line
break
print(l.rstrip())
That example reads the start of a line, and if the line has been truncated (when it doesn't end by a line termination, that is), I read the rest of the line, discarding it. Even if the line is very long, it doesn't consume memory. It's just dead slow.
About combining next() and readline(): those are concurrent mechanisms (manual iteration vs classical line read) and they mustn't be mixed because the buffering of one method may be ignored by the other one. But you can mix read() and readline(), for loop and next().

Try like this:
'''
$cat test.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
'''
from time import sleep # trust me on this one
lines = []
with open("test.txt", "r") as fh:
while True:
line = fh.readline(44)
print (line.strip())
if not line:
#sleep(0.05)
break
lines.append(line.strip())
if not line.endswith("\n"):
while fh.readline(1) != "\n":
pass
print(lines)
Quite simple, it will read 44 characters, and if its not ending in new line it will read 1 character at the time till it gets to it to avoid large chunks into the memory, only then will it go to process next 44 characters and append them to the list.
Dont forget to use line.strip() to avoid getting \n as a part of the string when its shorter than 44 characters.

I'm going to assume you're asking your original question here, and not your side question about temporary values (which Jean-François Fabre has already answered nicely).
Your existing solution doesn't actually solve your problem.
Let's say your attacker creates a line that's 100 million characters long. So:
You do a fh.readline(44), which reads the first 44 characters.
Then you do a fh.readline() to discard the rest of the line. This has to read the rest of the line into a string to discard it, so it uses up 100MB.
You could handle this by reading one character at a time in a loop until '\n', but there's a better solution: just fh.readline(44) in a loop until '\n'. Or maybe fh.readline(8192) or something—temporarily wasting 8KB (it's effectively the same 8KB being used over and over) isn't going to help your attacker.
For example:
while True:
line = fh.readline(20)
if not line:
break
lines.append(line.strip())
while line and not line.endswith('\n'):
line = fh.readline(8192)
In practice, this isn't going to be that much more efficient. A Python 2.x file object wraps a C stdio FILE, which already has a buffer, and with the default arguments to open, it's a buffer chosen by your platform. Let's say your platform uses 16KB.
So, whether you read(1) or readline(8192), it's actually reading 16KB at a time off disk into some hidden buffer, and just copying 1 or 8192 characters out of that buffer into a Python string.
And, while it obviously takes more time to loop 16384 times and build 16384 tiny strings than to loop twice and build two 8K strings, that time is still probably smaller than the disk I/O time.
So, if you understand the read(1) code better and can debug and maintain it more easily, just do that.
However, there might be a better solution here. If you're on a 64-bit platform, or your largest possible file is under 2GB (or it's acceptable for a file >2GB to raise an error before you even process it), you can mmap the file, then search it as if it were a giant string in memory:
from contextlib import closing
import mmap
lines = []
with open('ready.py') as f:
with closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
start = 0
while True:
end = m.find('\n', start)
if end == -1:
lines.append(m[start:start+44])
break
lines.append(m[start:min(start+44, end)])
start = end + 1
This maps the whole file into virtual memory, but most of that virtual memory is not mapped to physical memory. Your OS will automatically take care of paging it in and out as needed to fit well within your resources. (And if you're worried about "swap hell": swapping out an unmodified page that's already backed by a disk file is essentially instantaneous, so that's not an issue.)
For example, let's say you've got a 1GB file. On a laptop with 16GB of RAM, it'll probably end up with the whole file mapped into 1GB of contiguous memory by the time you reach the end, but that's also probably fine. On a resource-constrained system with 128MB of RAM, it'll start throwing out the least recently used pages, and it'll end up with just the last few pages of the file mapped into memory, which is also fine. The only difference is that, if you then tried to print m[0:100], the laptop would be able to do it instantaneously, while the embedded box would have to reload the first page into memory. Since you're not doing that kind of random access through the file, that doesn't come up.

How to start reading a file from a particular line in the case of a huge text file as I cannot iterate from line one

This is an issue of trying to reach to the line to start from and proceed from there in the shortest time possible.
I have a huge text file that I'm reading and performing operations line after line. I am currently keeping track of the line number that i have parsed so that in case of any system crash I know how much I'm done with.
How do I restart reading a file from the point if I don't want to start over from the beginning again.
count = 0
all_parsed = os.listdir("urltextdir/")
with open(filename,"r") as readfile :
for eachurl in readfile:
if str(count)+".txt" not in all_parsed:
urltext = getURLText(eachurl)
with open("urltextdir/"+str(count)+".txt","w") as writefile:
writefile.write(urltext)
result = processUrlText(urltext)
saveinDB(result)
This is what I'm currently doing, but when it crashes at a million lines, I'm having to through all these lines in the file to reach the point I want to start from, my Other alternative is to use readlines and load the entire file in memory.
Is there an alternative that I can consider.

Unfortunately line number isn't really a basic position for file objects, and the special seeking/telling functions are ruined by next, which is called in your loop. You can't jump to a line, but you can to a byte position. So one way would be:
line = readfile.readline()
while line:
line = readfile.readline(): #Must use `readline`!
lastell = readfile.tell()
print(lastell) #This is the location of the imaginary cursor in the file after reading the line
print(line) #Do with line what you would normally do
print(line) #Last line skipped by loop
Now you can easily jump back with
readfile.seek(lastell) #You need to keep the last lastell)
You would need to keep saving lastell to a file or printing it so on restart you know which byte you're starting at.
Unfortunately you can't use the written file for this, as any modification to the character amount will ruin a count based on this.
Here is one full implementation. Create a file called tell and put 0 inside of it, and then you can run:
with open('tell','r+') as tfd:
with open('abcdefg') as fd:
fd.seek(int(tfd.readline())) #Get last position
line = fd.readline() #Init loop
while line:
print(line.strip(),fd.tell()) #Action on line
tfd.seek(0) #Clear and
tfd.write(str(fd.tell())) #write new position only if successful
line = fd.readline() #Advance loop
print(line) #Last line will be skipped by loop
You can check if such a file exists and create it in the program of course.
As #Edwin pointed out in the comments, you may want to fd.flush() and os.fsync(fd.fileno) (import os if that isn't clear) to make sure after every write you file contents are actually on disk - this would apply to both write operations you are doing, the tell the quicker of the two of course. This may slow things down considerably for you, so if you are satisfied with the synchronicity as is, do not use that, or only flush the tfd. You can also specify the buffer when calling open size so Python automatically flushes faster, as detailed in https://stackoverflow.com/a/3168436/6881240.

If I got it right,
You could make a simple log file to store the count in.
but still would would recommand to use many files or store every line or paragraph in a database le sql or mongoDB

I guess it depends on what system your script is running on, and what resources (such as memory) you have available.
But with the popular saying "memory is cheap", you can simply read the file into memory.
As a test, I created a file with 2 million lines, each line 1024 characters long with the following code:
ms = 'a' * 1024
with open('c:\\test\\2G.txt', 'w') as out:
for _ in range(0, 2000000):
out.write(ms+'\n')
This resulted in a 2 GB file on disk.
I then read the file into a list in memory, like so:
my_file_as_list = [a for a in open('c:\\test\\2G.txt', 'r').readlines()]
I checked the python process, and it used a little over 2 GB in memory (on a 32 GB system)
Access to the data was very fast, and can be done by list slicing methods.
You need to keep track of the index of the list, when your system crashes, you can start from that index again.
But more important... if your system is "crashing" then you need to find out why it is crashing... surely a couple of million lines of data is not a reason to crash anymore these days...

How does readline() work behind the scenes when reading a text file?

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:
How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!

Example using the file file.txt:
fake file
with some text
in a few lines
Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.
Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.
with open('file.txt', 'r') as fd:
print fd.tell()
fd.readline()
print fd.tell()
# output:
0
10
# Or 11, depending on the line separators in the file
Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?
First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files
For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.
Example, jumping to the 2nd line in the example file and reading until the end:
import mmap
import contextlib
with open('file.txt', 'r') as fd:
with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
print mm[10:]
# output:
with some text
in a few lines

This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:
readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.
with open("myfile.txt") as f:
while True:
line = f.readline()
if not line:
break
# do something with the line
Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.
You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.
import linecache
linecache.getline("myfile.txt", 5) # we already know we want line 5

writing output for python not functioning

I am attempting to output a new txt file but it come up blank. I am doing this
my_file = open("something.txt","w")
#and then
my_file.write("hello")
Right after this line it just says 5 and then no text comes up in the file
What am I doing wrong?

You must close the file before the write is flushed. If I open an interpreter and then enter:
my_file = open('something.txt', 'w')
my_file.write('hello')
and then open the file in a text program, there is no text.
If I then issue:
my_file.close()
Voila! Text!
If you just want to flush once and keep writing, you can do that too:
my_file.flush()
my_file.write('\nhello again') # file still says 'hello'
my_file.flush() # now it says 'hello again' on the next line
By the way, if you happen to read the beautiful, wonderful documentation for file.write, which is only 2 lines long, you would have your answer (emphasis mine):
Write a string to the file. There is no return value. Due to buffering, the string may not actually show up in the file until the flush() or close() method is called.

If you don't want to care about closing file, use with:
with open("something.txt","w") as f:
f.write('hello')
Then python will take care of closing the file for you automatically.

As Two-Bit Alchemist pointed out, the file has to be closed. The python file writer uses a buffer (BufferedIOBase I think), meaning it collects a certain number of bytes before writing them to disk in bulk. This is done to save overhead when a lot of write operations are performed on a single file.
Also: When working with files, try using a with-environment to make sure your file is closed after you are done writing/reading:
with open("somefile.txt", "w") as myfile:
myfile.write("42")
# when you reach this point, i.e. leave the with-environment,
# the file is closed automatically.

The python file writer uses a buffer (BufferedIOBase I think), meaning
it collects a certain number of bytes before writing them to disk in
bulk. This is done to save overhead when a lot of write operations are
performed on a single file. Ref #m00am
Your code is also okk. Just add a statement for close file, then work correctly.
my_file = open("fin.txt","w")
#and then
my_file.write("hello")
my_file.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large File handling in Python - python

Related

Python stops printing to output file [duplicate]

Ignore the rest of the line read after using file.readline(size)

How to start reading a file from a particular line in the case of a huge text file as I cannot iterate from line one

How does readline() work behind the scenes when reading a text file?

writing output for python not functioning

Categories

Resources