Read file in chunks - RAM-usage, reading strings from binary files - python

I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.
Version 1, found here on stackoverflow:
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(file, 'rb')
for piece in read_in_chunks(f):
process_data(piece)
f.close()
Version 2, I used this before I found the code above:
f = open(file, 'rb')
while True:
piece = f.read(1024)
process_data(piece)
f.close()
The file is read partially in both versions. And the current piece could be processed. In the second example, piece is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.
But I don't really understand what yield does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?
There is something else that puzzles me, besides of the method used:
The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"?
Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr" - and the next piece would contain "ingILikeToFind". Using such a method it's not possible to detect the whole string in any piece.
Is there a way to read a file in chunks - but somehow care about such strings?

yield is the keyword in python used for generator expressions. That means that the next time the function is called (or iterated on), the execution will start back up at the exact point it left off last time you called it. The two functions behave identically; the only difference is that the first one uses a tiny bit more call stack space than the second. However, the first one is far more reusable, so from a program design standpoint, the first one is actually better.
EDIT: Also, one other difference is that the first one will stop reading once all the data has been read, the way it should, but the second one will only stop once either f.read() or process_data() throws an exception. In order to have the second one work properly, you need to modify it like so:
f = open(file, 'rb')
while True:
piece = f.read(1024)
if not piece:
break
process_data(piece)
f.close()

starting from python 3.8 you might also use an assignment expression (the walrus-operator):
with open('file.name', 'rb') as file:
while chunk := file.read(1024):
process_data(chunk)
the last chunk may be smaller than CHUNK_SIZE.
as read() will return b"" when the file has been read the while loop will terminate.

I think probably the best and most idiomatic way to do this would be to use the built-in iter() function along with its optional sentinel argument to create and use an iterable as shown below. Note that the last chunk might be less that the requested chunk size if the file size isn't an exact multiple of it.
from functools import partial
CHUNK_SIZE = 1024
filename = 'testfile.dat'
with open(filename, 'rb') as file:
for chunk in iter(partial(file.read, CHUNK_SIZE), b''):
process_data(chunk)
Update: Don't know when it was added, but almost exactly what's above is in now shown as an example in the official documentation of the iter() function.

Related

Read N number of bytes from stdin of python and output to a temp file for further processing

I would like to read a fixed number of bytes from stdin of a python script and output it to one temporary file batch by batch for further processing. Therefore, when the first N number of bytes are passed to the temp file, I want it to execute the subsequent scripts and then read the next N bytes from stdin. I am not sure what to iterate over in the top loop before While true. This is an example of what I tried.
import sys
While True:
data = sys.stdin.read(2330049) # Number of bytes I would like to read in one iteration
if data == "":
break
file1=open('temp.fil','wb') #temp file
file1.write(data)
file1.close()
further_processing on temp.fil (I think this can only be done after file1 is closed)
Two quick suggestions:
You should pretty much never do While True
Python3
Are you trying to read from a file? or from actual standard in? (Like the output of a script piped to this?)
Here is an answer I think will work for you, if you are reading from a file, that I pieced together from some other answers listed at the bottom:
with open("in-file", "rb") as in_file, open("out-file", "wb") as out_file:
data = in_file.read(2330049)
while byte != "":
out_file.write(data)
If you want to read from actual standard in, I would read all of it in, then split it up by bytes. The only way this won't work is if you are trying to deal with constant streaming data...which I would most definitely not use standard in for.
The .encode('UTF-8') and .decode('hex') methods might be of use to you also.
Sources: https://stackoverflow.com/a/1035360/957648 & Python, how to read bytes from file and save it?

Python conditional statement based on text file string

Noob question here. I'm scheduling a cron job for a Python script for every 2 hours, but I want the script to stop running after 48 hours, which is not a feature of cron. To work around this, I'm recording the number of executions at the end of the script in a text file using a tally mark x and opening the text file at the beginning of the script to only run if the count is less than n.
However, my script seems to always run regardless of the conditions. Here's an example of what I've tried:
with open("curl-output.txt", "a+") as myfile:
data = myfile.read()
finalrun = "xxxxx"
if data != finalrun:
[CURL CODE]
with open("curl-output.txt", "a") as text_file:
text_file.write("x")
text_file.close()
I think I'm missing something simple here. Please advise if there is a better way of achieving this. Thanks in advance.
The problem with your original code is that you're opening the file in a+ mode, which seems to set the seek position to the end of the file (try print(data) right after you read the file). If you use r instead, it works. (I'm not sure that's how it's supposed to be. This answer states it should write at the end, but read from the beginning. The documentation isn't terribly clear).
Some suggestions: Instead of comparing against the "xxxxx" string, you could just check the length of the data (if len(data) < 5). Or alternatively, as was suggested, use pickle to store a number, which might look like this:
import pickle
try:
with open("curl-output.txt", "rb") as myfile:
num = pickle.load(myfile)
except FileNotFoundError:
num = 0
if num < 5:
do_curl_stuff()
num += 1
with open("curl-output.txt", "wb") as myfile:
pickle.dump(num, myfile)
Two more things concerning your original code: You're making the first with block bigger than it needs to be. Once you've read the string into data, you don't need the file object anymore, so you can remove one level of indentation from everything except data = myfile.read().
Also, you don't need to close text_file manually. with will do that for you (that's the point).
Sounds more for a job scheduling with at command?
See http://www.ibm.com/developerworks/library/l-job-scheduling/ for different job scheduling mechanisms.
The first bug that is immediately obvious to me is that you are appending to the file even if data == finalrun. So when data == finalrun, you don't run curl but you do append another 'x' to the file. On the next run, data will be not equal to finalrun again so it will continue to execute the curl code.
The solution is of course to nest the code that appends to the file under the if statement.
Well there probably is an end of line jump \n character which makes that your file will contain something like xx\n and not simply xx. Probably this is why your condition does not work :)
EDIT
What happens if through the python command line you type
open('filename.txt', 'r').read() # where filename is the name of your file
you will be able to see whether there is an \n or not
Try using this condition along with if clause instead.
if data.count('x')==24
data string may contain extraneous data line new line characters. Check repr(data) to see if it actually a 24 x's.

Index out of range for simple `readlines` operation in Python

Basically I am trying to read the last few lines. I am baffled what leads to this error:
nrows = 10
with open(filename, "r") as f:
for ii in xrange(-nrows, 0, 1):
last_line = f.readlines()[ii].strip().split(",")
When I tried it in the IDE, it has no problem when I use mere numbers, like this:
with open(filename, "r") as f:
last_line = f.readlines()[-10].strip().split(",")
Somehow it looks like the xrange generated indices cannot be "used". Is there a way around this?
Thank you very much for your attention and time!
The reason it works in your second example is because you call readlines() only once.
you can read file only once, so once you read all lines the file iterator points at EOF and returns empty list which will fail your [ii] attempt.
To read file more than once you would have to call file.seek(0) after each round to move the iterator to the start again, but it will be much more efficient to read the lines into variable.
you may be interested in file like objects and also may consider slice instead of the xrange

how to get tell() to work

I'm trying to open a file and read from the last point read. My files are rather big (20 Mb to ~ 1 Gb) After doing some research it seems that tell() and seek() would be one of the most efficient ways to perform this. I've tried the following code
opened = open(filename, "rU")
f1 = csv.reader(opened)
k = []
for line in f1:
k.append(opened.tell())
When I do this every value in the list is 8272 Long. Does that mean that I cannot use this implementation? Is there something I'm missing? Thanks for your help!
I'm running python 2.7 in Windows 7
Update
After piecing together everything learned here and trial and error I get the following code
opened = open(filename, "rU")
k = [0]
where = 1
for switch in opened:
where += len(switch) + 1
f = StringIO.StringIO(switch)
interesting = csv.reader(f, delimiter=',')
good_values = interesting.next()
k.append(where)
return k
This allows the user to know exactly where in the file to go to while still being able to parse it according to its format. I'm not completely sure of why the offsets need to be constantly added (It seems that the newline is not accurately accounted for in len()).
It looks like the csv.reader is reading the file in chunks of 8272 bytes, that's why you see this number returned from opened.tell() many times - until, I guess, you have read all the lines from your file in the range of 0-8272. After that you will see 8272*2 a few times, exact number will depend on the length of the lines in the buffer read.
So, basically, in your program, tell() doesn't give you offsets of new CSV lines, as you seem to assume. It's only telling you about offset of the end of the file's region currently read into an internal OS buffer used by system functions used to implement the Python's IO functions.

"for line in file object" method to read files

I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()

Categories

Resources