Python dumbdbm, when will data be written back to disk?

Python dumbdbm, when will data be written back to disk? - python

I'm using Python2.7's dumbdbm, but this question also applies to Python3's dbm.dumb.
The documentation says:
dumbdbm.sync()
Synchronize the on-disk directory and data files. This method is called by the sync() method of Shelve objects.
I've got three questions:
If I don't call sync, will disk file get updated?
And does this function always write data back to disk, not inverse?
What if I call close?

One — perhaps the best if not only — way to answer questions like this that aren't specifically addressed in the documentation is to read the source code (when it's available, as it is here).
The dumbdbm.py file should be in your /Python/Lib directory and can also be viewed online in your browser through the Mercurial source code revision control system at:
https://hg.python.org/cpython/file/2.7/Lib/dumbdbm.py
The first thing to notice is the longish comment at the beginning of the private _Database class — which is what a dumbdbm database really is — because it seems to generally deal with what seems to be overall theme of your questions:
class _Database(UserDict.DictMixin):
# The on-disk directory and data files can remain in mutually
# inconsistent states for an arbitrarily long time (see comments
# at the end of __setitem__). This is only repaired when _commit()
# gets called. One place _commit() gets called is from __del__(),
# and if that occurs at program shutdown time, module globals may
# already have gotten rebound to None. Since it's crucial that
# _commit() finish successfully, we can't ignore shutdown races
# here, and _commit() must not reference any globals.
In-depth information about specific methods can be found by reading the source code for them. Given that, here's what I think the answers to your questions would be for version 2.7 of Python:
If I don't call sync, will disk file get updated?
From the preceding comment, it sounds like it will as long as your program shuts down gracefully.
Beyond that it depends on the methods that have been called. Some may, but only partially. For instance, it looks like __setitem__() does, depending on whether the item is for a entirely new key or an existing one. For the latter cases there's a comment at the end of part that deals with them that says (realizing that _commit() is just another name for sync()):
Note that _index may be out of synch with the directory file now:
_setval() and _addval() don't update the directory file. This also means that the on-disk directory and data files are in a mutually
inconsistent state, and they'll remain that way until _commit() is
called. Note that this is a disaster (for the database) if the
program crashes (so that _commit() never gets called).
And does this function always write data back to disk, not inverse?
sync() / _commit() does not appear to load any data back into memory from the disk.
What if I call close?
close() just calls _commit() and then sets all internal data structures to None, preventing any further database operations.
In conclusion, for a somewhat humorous take on the meta-subject here, I suggest you read Learn to Read the Source, Luke.

Related

Overwrite a python file while using it?

I have first file (data.py):
database = {
'school': 2,
'class': 3
}
my second python file (app.py)
import data
del data.database['school']
print(data.database)
>>>{'class': 3}
But in data.py didn't change anything? Why?
And how can I change it from my app.py?

del data.database['school'] modifies the data in memory, but does not modify the source code.
Modifying a source code to manage the persistence of your data is not a good practice IMHO.
You could use a database, a csv file, a json file ...

To elaborate on Gelineau's answer: at runtime, your source code is turned into a machine-usable representation (known as "bytecode") which is loaded into the process memory, then executed. When the del data.database['school'] statement (in it's bytecode form) is executed, it only modifies the in-memory data.database object, not (hopefully!) the source code itself. Actually, your source code is not "the program", it's a blueprint for the runtime process.
What you're looking for is known as data persistance (data that "remembers" it's last known state between executions of the program). There are many solutions to this problem, ranging from the simple "write it to a text or binary file somewhere and re-read it at startup" to full-blown multi-servers database systems. Which solution is appropriate for you depends on your program's needs and constraints, whether you need to handle concurrent access (multiple users / processes editing the data at the same time), etc etc so there's really no one-size-fits-all answers. For the simplest use cases (single user, small datasets etc), json or csv files written to disk or a simple binary key:value file format like anydbm or shelve (both in Python's stdlib) can be enough. As soon as things gets a bit more complex, SQL databases are most often your best bet (no wonder why they are still the industry standard and will remain so for long years).
In all cases, data persistance is not "automagic", you will have to write quite some code to make sure your changes are saved in timely manner.

As what you are trying to achieve is basically related to file operation.
So when you are importing data , it just loads instanc of your file in memory and create a reference from your new file, ie. app.py. So, if you modify it in app.py its just modifiying the instance which is in RAM not in harddrive where your actual file is stored in harddrive.
If you want to change source code of another file "As its not good practice" then you can use file operations.

Force directory to be created in Python

I am running Python with MPI on a supercomputing cluster. I am getting strange nondeterministic behavior that I think is a result of I/O complications that are not present on the single machines I'm used to working with.
One of the things my code does is to create directories using os.makedirs somewhat frequently. I know also that I generally should not write small amounts of data to the filesystem-- this can end up with the data getting stuck in some buffer and not written for a long time. I suspect this may be happening with my directory creation calls, and then later code tries to write to files inside the directory before it exists. Two questions:
is creating a new directory effectively the same thing as writing a small amount of data?
When forcing data to be written, I use flush and os.fsync. These require a file object. Is there an equivalent to make sure the directory has been created?

Creating a new directory is effectively the same as writing small amount of data. It adds an inode.
The only way mkdir (or os.mkdirs) should fail is if the directory exists - otherwise the directory will always be created. In terms of the data being buffered - it's unlikely that this would happen - even journaled filesystems will sync out pretty regularly.
If you're having non-deterministic behavior, just wrap your directory creation / writing a file into that directory inside a try / except / finally that makes a few efforts? But really - the need for such code hints at something much more sinister and is likely a bigger issue.

Python - Tailing a logfile - sleep() versus inotify?

I'm writing a Python script that needs to tail -f a logfile.
The operating system is RHEL, running Linux 2.6.18.
The normal approach I believe is to use an infinite loop with sleep, to continually poll the file.
However, since we're on Linux, I'm thinking I can also use something like pyinotify (https://github.com/seb-m/pyinotify) or Watchdog (https://github.com/gorakhargosh/watchdog) instead?
What are the pros/cons of the this?
I've heard that using sleep(), you can miss events, if the file is growing quickly - is that possible? I thought GNU tail uses sleep as well anyhow?
Cheers,
Victor

The cleanest solution would be inotify in many ways - this is more or less exactly what it's intended for, after all. If the log file was changing extremely rapidly then you could potentially risk being woken up almost constantly, which wouldn't necessarily be particularly efficient - however, you could always mitigate this by adding a short delay of your own after the inotify filehandle returns an event. In practice I doubt this would be an issue on most systems, but I thought it worth mentioning in case your system is very tight on CPU resources.
I can't see how the sleep() approach would miss file updates except in cases where the file is truncated or rotated (i.e. renamed and another file of the same name created). These are tricky cases to handle however you do things, and you can use tricks like periodically re-opening the file by name to check for rotation. Read the tail man page because it handles many such cases, and they're going to be quite common for log files in particular (log rotation being widely considered to be good practice).
The downside of sleep() is of course that you'd end up batching up your reads with delays in between, and also that you have the overhead of constantly waking up and polling the file even when it's not changing. If you did this, say, once per second, however, the overhead probably isn't noticeable on most systems.
I'd say inotify is the best choice unless you want to remain compatible, in which case the simple fallback using sleep() is still quite reasonable.
EDIT:
I just realised I forgot to mention - an easy way to check for a file being renamed is to perform an os.fstat(fd.fileno()) on your open filehandle and a os.stat() on the filename you opened and compare the results. If the os.stat() fails then the error will tell you if the file's been deleted, and if not then comparing the st_ino (the inode number) fields will tell you if the file's been deleted and then replaced with a new one of the same name.
Detecting truncation is harder - effectively your read pointer remains at the same offset in the file and reading will return nothing until the file content size gets back to where you were - then the file will read from that point as normal. If you call os.stat() frequently you could check for the file size going backwards - alternatively you could use fd.tell() to record your current position in the file and then perform an explicit seek to the end of the file and call fd.tell() again. If the value is lower, then the file's been truncated under you. This is a safe operation as long as you keep the original file position around because you can always seek back to it after the check.
Alternatively if you're using inotify anyway, you could just watch the parent directory for changes.
Note that files can be truncated to non-zero sizes, but I doubt that's likely to happen to a log file - the common cases will be being deleted and replaced, or truncated to zero. Also, I don't know how you'd detect the case that the file was truncated and then immediately filled back up to beyond your current position, except by remembering the most recent N characters and comparing them, but that's a pretty grotty thing to do. I think inotify will just tell you the file has been modified in that case.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

What's the correct way to use win32file.ReadFile to get the output from a pipe?

I'm using the pywin32 extensions to access the win32 API under Python. I'm new at doing Windows programming in Python -- I'm a POSIX guy -- so I may be doing things in a bone-headed manner.
I'm trying to use the win32file.ReadFile function properly, and I'm having some trouble interpreting the possible result codes.
I'm calling the function like this:
result, data = win32file.ReadFile(child_stdout_r, 4096, None)
I'm reading the output from a child process that I launch. I get good data, but I'm concerned that there may be more data in the pipe than 4096 characters. (And I'd rather do this right, instead of just picking an arbitrarily large buffer size.)
In the case where there's more than 4096 characters to read, I would need to run win32file.ReadFile multiple times until I exhaust the pipe. To find out whether I need to run ReadFile multiple times, I need to interpret the result code.
The ActiveState docs say that:
The result is a tuple of (hr, string/PyOVERLAPPEDReadBuffer), where hr may be 0, ERROR_MORE_DATA or ERROR_IO_PENDING.
Since I'm setting the overlapped value to None in the function call, I think I don't need to worry about any PyOVERLAPPEDReadBuffer stuff. (And since I'm getting valid data, I think I'm right.)
I have two problems with the hr result variable:
I can't find the values of the constants ERROR_MORE_DATA or ERROR_IO_PENDING anywhere.
The ActiveState docs seem to imply that 0 is success and the constants (whatever they are) indicate failure. The Microsoft docs state that 0 indicates failure, non-zero indicates success, and you need to run GetLastError to find out more.
What's the correct way to do this?
EDITED TO ADD: I'm not using subprocess because I need to add the child process to a job object I create. The goal is to have all child processes die immediately if the parent process dies. By adding the child process to the job object, the child process will be terminated when the last handle to the job object is closed. The handle, held by the parent, will be closed when the parent exits. All of this, as far as I can tell, precludes me from using subprocess.

For error codes, try winerror.ERROR_MORE_DATA and winerror.ERROR_IO_PENDING
My interpretation of the ActiveState docs is the same as yours. It sounds like the wrapper works slightly differently than the native API.
I haven't actually tried this.

Consider using subprocess to launch the process. It will give you a set of file-like objects that you can use to talk with the other app.
The .terminate() method of the Popen object will allow you to terminate the process if you are running 2.6+.

note that ReadFile is defined as:
(int, string) = ReadFile(hFile, buffer/bufSize , overlapped)
where...
hFile = PyHANDLE
which is any windows handle (can be file, process, thread...)
buffer/bufSize = PyOVERLAPPEDReadBuffer
which, according to documentation automatically allocates contents of hFile regardless if it overlaps or not.
overlapped=None [=PyOVERLAPPED]
you can allocate an additional object to take any extra data, beyond the overlapped (buffer/bufSize) if you wish, but by default this is NULL.
So - you can basically call ReadFile like:
ReadFile(child_stdout_r, 0, None)
and the object you assign it to will contain the full contents of the file handle.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.