How to avoid file corruption in Python?

How to avoid file corruption in Python? - python

I have a pretty basic question, perhaps I didn't know the right keywords as I couldn't find a previous answer. I use Python scripts control and gather information for a smarthome environment. I mostly use text files to store and update information within and between the scripts. However, I frequently run into this one issue whenever the server crashes or loses power: The file contents tend to corrupt or vanish while the crash happens.
To write file content, I usually use a structure like this:
try:
with open(savefile, "r") as file:
lines = file.readlines()
except:
lines = []
pass
lines.append(str(time.time()) + ";" + str(value) + "\n")
if len(lines) > MAX_READINGS:
lines = lines[-MAX_READINGS:]
with open(savefile, "w") as file:
file.writelines(lines)
In case of partial corruption such as blank lines between the data points, I often use a line-by-line loop that only qualifies lines with the correct structure (such as a timestamp in the beginning in the example-like cases). However, sometimes a file gets corrupted to the point it only contains spaces or is empty, getting useless for the scrips depending on the data.
The filesystem's integrity remains intact in crashes, so it's probably not a lower level problem. But what's the suggested workaround to minimize the corruption risk?
Should I use the "a" mode to append a new line and have another way to deal with the file lengths (the MAX_READINGS), or should I make a temporary copy which I'd then use to overwrite the original after the writing is done? Or might there be an external library providing the right functionality?

Related

Approaches to loading JSON from file to dict using Python?

Using Python I'm loading JSON from a text file and converting it into a dictionary. I thought of two approaches and wanted to know which would be better.
Originally I open the text file, load the JSON, and then close text file.
import json
// Open file. Load as JSON.
data_file = open(file="fighter_data.txt", mode="r")
fighter_match_data = json.load(data_file)
data_file.close()
Could I instead do the following instead?
import json
// Open file. Load as JSON.
fighter_match_data = json.load(open(file="fighter_data.txt", mode="r"))
Would I still need to close the file? If so, how? If not, does Python close the file automatically?

Personally wouldn't do either. Best practice for opening files generally is to use with.
with open(file="fighter_data.txt", mode="r") as data_file:
fighter_match_data = json.load(data_file)
That way it automatically closes when you're out of the with statement. It's shorter than the first, and if it throws an error (say, there's an error parsing the json), it'll still close it.
Regarding your actual question, on needing to close the file in your one liner.
From what I understand about file handling and garbage collection, if you're using CPython, since the file isn't referenced anymore it "should" be closed straight away by the garbage collector. However, relying on garbage collection to do your work for you is never the nicest way of writing code. (See the answers to open read and close a file in 1 line of code for information as to why).

Your code as under is valid:
fighter_match_data = json.load(open(file="fighter_data.txt", mode="r"))
Consider this part:
open(file="fighter_data.txt", mode="r") . #1
v/s
data_file = open(file="fighter_data.txt", mode="r") . #2
In case of #2, in case you do not explicitly close the file, the file will automatically be closed when the variable ceases to exist[In better words, no reference exists to that variable] (when you move out of the function).
In case of #1, since you never create a variable, the lifespan of that implicit variable created for opening that file ceases to exist on that line itself. And python automatically closes the file after opening it.

What happens if I read a file without closing it afterwards?

I used to read files like this:
f = [i.strip("\n") for i in open("filename.txt")]
which works just fine. I prefer this way because it is cleaner and shorter than traditional file reading code samples available on the web (e.g. f = open(...) , for line in f.readlines() , f.close()).
However, I wonder if there can be any drawback for reading files like this, e.g. since I don't close the file, does Python interpreter handles this itself? Is there anything I should be careful of using this approach?

This is the recommended way:
with open("filename.txt") as f:
lines = [line.strip("\n") for line in f]
The other way may not close the input file for a long time. This may not matter for your application.
The with statement takes care of closing the file for you. In CPython, just letting the file handle object be garbage-collected should close the file for you, but in other flavors of Python (Jython, IronPython, PyPy) you definitely can't count on this. Also, the with statement makes your intentions very clear, and conforms with common practice.

From the docs:
When you’re done with a file, call f.close() to close it and free up any system resources taken up by the open file.
You should always close a file after working with it. Python will not automatically do it for you. If you want a cleaner and shorter way, use a with statement:
with open("filename.txt") as myfile:
lines = [i.strip("\n") for i in myfile]
This has two advantages:
It automatically closes the file after the with block
If an exception is raised, the file is closed regardless.

It might be fine in a limited number of cases, e.g. a temporary test.
Python will only close the file handle after it finishes the execution.
Therefore this approach is a no-go for a proper application.

When we write onto a file using any of the write functions. Python holds everything to write in the file in a buffer and pushes it onto the actual file on the storage device either at the end of the python file or if it encounters a close() function.
So if the file terminates in between then the data is not stored in the file. So I would suggest two options:
use with because as soon as you get out of the block or encounter any exception it closes the file,
with open(filename , file_mode) as file_object:
do the file manipulations........
or you can use the flush() function if you want to force python to write contents of buffer onto storage without closing the file.
file_object.flush()
For Reference: https://lerner.co.il/2015/01/18/dont-use-python-close-files-answer-depends/

How to store many file handles so that we can close it at a later stage

I have written a small program in python where I need to open many files and close it at a later stage, I have stored all the file handles in a list so that I can refer to it later for closing.
In my program I am storing all the file handles (fout) in the list foutList[]
for cnt in range(count):
fileName = "file" + `cnt` + ".txt"
fullFileName = path + fileName
print "opening file " + fullFileName
try:
fout = open(fullFileName,"r")
foutList.append(fout)
except IOError as e:
print "Cannot open file: %s" % e.strerror
break
Some people suggested me that do no store it in a List, but did not give me the reason why. Can anyone explain why it is not recommended to store it in a List and what is the other possible way to do the same ?

I can't think of any reasons why this is really evil, but possible objections to doing this might include:
It's hard to guarantee that every single file handle will be closed when you're done. Using the file handle with a context manager (see the with open(filename) as file_handle: syntax) always guarantees the file handle is closed, even if something goes wrong.
Keeping lots of files open at the same time may be impolite if you're going to have them open for a long time, and another program is trying to access the files.
This said - why do you want to keep a whole bunch of files open for writing? If you're writing intermittently to a bunch of files, a better way to do this is to open the file, write to it, and then close it until you're ready to write again.
All you have to do is open the file in append mode - open(filename,'a'). This lets you write to the end of an existing file without erasing what's already there (like the 'w' mode.)
Edit(1) I slightly misread your question - I thought you wanted to open these files for writing, not reading. Keeping a bunch of files open for reading isn't too bad.
If you have the files open because you want to monitor the files for changes, try using your platform's equivalent of Linux's inotify, which will tell you when a file has changed (without you having to look at it repeatedly.)

If you don't store them at all, they will eventually be garbage collected, which will close them.
If you really want to close them manually, use weak references to hold them, which will not prevent garbage collection: http://docs.python.org/library/weakref.html

Simple way to add text at the beginning of a script (file) in Python

I am a Python beginner and my next project is a program in which you enter the details of your program and then select the file (I'm using Tkinter), and then the program will format the details and write them to the start of the file.
I know that you'd have to 'rewrite' it and that a tmp file is probably in hand. I just want to know simple ways that one could achieve adding text to the beginning of a file.
Thanks.

To add text to the beginning of a file, you can (1) open the file for reading, (2) read the file, (3) open the file for writing and overwrite it with (your text + the original file text).
formatted_text_to_add = 'Sample text'
with open('userfile', 'rb') as filename:
filetext = filename.read()
newfiletext = formatted_text_to_add + '/n' + filetext
with open('userfile', 'wb') as filename:
filename.write(newfiletext)
This requires two I/O operations and I'm tempted to look for a way to do it in one pass. However, prior answers to similar questions suggest that trying to write to the beginning or middle of a file in Python gets complicated quite quickly unless you bite the bullet and overwrite the original file with the new text.

If I understand what you're asking, I believe you're looking for what's called a project skeleton. This link handles it pretty well.

This probably won't solve your exact problem, as you will need to know in advance the exact number of bytes you'll be adding to the beginning of the file.
# Put some text in the file
f = open("tmp.txt", "w")
print >>f, "123456789"
f.close()
# Open the file in read/write mode
f = open("tmp.txt", "r+")
f.seek(0) # reposition the file pointer to the beginning of the file
f.write('abc') # use write to avoid writing new lines
f.close()
When you reposition the file pointer using seek, you can overwrite the bytes that are already stored at that position. You can't, however, "insert" text, pushing existing bytes ahead to make room for new data. When I said you would need to know the exact number of bytes,
I meant you would have to "leave room" for the text at the beginning of the file. Something like:
f = open("tmp.txt", "w")
f.write("\0\0\0456789")
f.close()
# Some time later...
f = open("tmp.txt", "r+")
f.seek(0)
f.write('123')
f.close()
For text files, this can work if you leave a "blank" line of, say, 50 spaces at the beginning of the file. Later, you can go back and overwrite up to 50 bytes (the newline being byte 51)
without overwriting following lines. Of course, you can leave multiple lines at the beginning. The point is that you can't grow or shrink your reserved block of lines to be overwritten. There's nothing special about the newline in a file, other than that it is treated specially by file methods like read and readline for splitting blocks of data into separate strings.
To add one of more lines of text to the beginning of a file, without overwriting what's already present, you'll have to use the "read the old file, write to a new file" solution outlined in other answers.

"for line in file object" method to read files

I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.

Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)

If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to avoid file corruption in Python? - python

Related

Approaches to loading JSON from file to dict using Python?

What happens if I read a file without closing it afterwards?

How to store many file handles so that we can close it at a later stage

Simple way to add text at the beginning of a script (file) in Python

"for line in file object" method to read files

Categories

Resources