Can problems happen when multiple Python processes create a .pyc file? - python

Suppose I have a large number of Python processes launching at the same time from a common directory.
If a Python source file has been recently modified the interpreter will compile a .pyc file.
If there are multiple processes simultaneously trying to build a .pyc for the same Python source file, can this create a race condition or other issues?
Will Python (or cpython specifically) guarantee concurrency protection when compiling?
I'm aware of methods that could be used to avoid this, I'm only interested in understanding if this use case can be problematic.

Generally no, when CPython write bytecode cache file, it first write to a temporary file, then move to the desired location with os.replace. os.replace uses rename(2) system call underlying, rename() is atomic if the OS/filesystem does not crash in the middle. As a result bytecode file write is atomic.
IMHO what you should worry about is bytecode cache file stale checking. Python check cache file freshness with source file stat.mtime (and file size). The caveat is, resolution of mtime being used by python is one second, thus if one process modified source file while another process is writing cache file in the same second, left the bytecode cache file inconsistent with source and will not get refreshed later. Good news is pep-0552 get accepted, changed to hashed based cache file, which will take care of this case.

Related

Do functions in the "os" module wait to be finished?

With:
import os
for file in files:
os.remove(file)
Will it wait for every file to be removed like a synchronous function in each call to os.remove(), or will it iterate through while calling os.remove()?
Generally, the os module provides wrappers around system calls of the operating system. For example, on Linux the os.remove/os.unlink functions correspond to the unlink system call. These functions wait until the system call has finished.
Whether this means that the high level operation intended by the program has finished depends on the use-case.
For example, unlink merely removes the path pointing to the file content; if there are other paths for the same file (i.e. hardlinks) or processes with a file handle on it, the file content remains. Only when all references are gone is the file content eligible for removal from the filesystem (similar to reference counting). The filesystem itself may arbitrarily delay removal of the content, and distributed filesystems may have additional consistency and synchronisation constraints.
As a rule of thumb, if there are no special requirements then it is fine to consider the os call to be prompt and synchronous. If there are specific requirements, such as file content being completely destroyed, read up on the specific behaviour of the involved components.

Does Python's "append" file write mode only write new bytes, or does it re-write the entire file as well?

Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.
Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference

Pythonic way to handle if the file being written is deleted externally

A python process is writing to a file and the file has been deleted/moved by an external process (cron job in my case).
The python process will continue to execute without any errors (expected as it is being written to the buffer rather than the file and will be flushed after f.close()). Yet there won't be any new file created in this case and buffer will be silently discarded(correct me if I'm wrong).
Is there any pythonic way to handle this instead of checking if file exists and create one if not, before every write operation.
There is no "pythonic" way to do this because the question isn't about a specific language. It's an operating system question. So the answer is going to be different for MS Windows than it is for a UNIX like OS such as Linux or macOS. To do this efficiently requires using a facility such as the Linux inotify API. A simpler approach that will work on any UNIX like OS is to open the file then call os.fstat() and remember the st_ino member of the returned object. Then periodically call os.stat() on the path name and compare its st_ino value to the one you saved earlier. If it changes, or the os.stat() call fails, then you know the file name you are writing to is no longer the same file.

What are the dangers of not closing files in Python? [duplicate]

In Python, if you either open a file without calling close(), or close the file but not using try-finally or the "with" statement, is this a problem? Or does it suffice as a coding practice to rely on the Python garbage-collection to close all files? For example, if one does this:
for line in open("filename"):
# ... do stuff ...
... is this a problem because the file can never be closed and an exception could occur that prevents it from being closed? Or will it definitely be closed at the conclusion of the for statement because the file goes out of scope?
In your example the file isn't guaranteed to be closed before the interpreter exits. In current versions of CPython the file will be closed at the end of the for loop because CPython uses reference counting as its primary garbage collection mechanism but that's an implementation detail, not a feature of the language. Other implementations of Python aren't guaranteed to work this way. For example IronPython, PyPy, and Jython don't use reference counting and therefore won't close the file at the end of the loop.
It's bad practice to rely on CPython's garbage collection implementation because it makes your code less portable. You might not have resource leaks if you use CPython, but if you ever switch to a Python implementation which doesn't use reference counting you'll need to go through all your code and make sure all your files are closed properly.
For your example use:
with open("filename") as f:
for line in f:
# ... do stuff ...
Some Pythons will close files automatically when they are no longer referenced, while others will not and it's up to the O/S to close files when the Python interpreter exits.
Even for the Pythons that will close files for you, the timing is not guaranteed: it could be immediately, or it could be seconds/minutes/hours/days later.
So, while you may not experience problems with the Python you are using, it is definitely not good practice to leave your files open. In fact, in cpython 3 you will now get warnings that the system had to close files for you if you didn't do it.
Moral: Clean up after yourself. :)
Although it is quite safe to use such construct in this particular case, there are some caveats for generalising such practice:
run can potentially run out of file descriptors, although unlikely, imagine hunting a bug like that
you may not be able to delete said file on some systems, e.g. win32
if you run anything other than CPython, you don't know when file is closed for you
if you open the file in write or read-write mode, you don't know when data is flushed
The file does get garbage collected, and hence closed. The GC determines when it gets closed, not you. Obviously, this is not a recommended practice because you might hit open file handle limit if you do not close files as soon as you finish using them. What if within that for loop of yours, you open more files and leave them lingering?
Hi It is very important to close your file descriptor in situation when you are going to use it's content in the same python script. I today itself realize after so long hecting debugging. The reason is content will be edited/removed/saved only after you close you file descriptor and changes are affected to file!
So suppose you have situation that you write content to a new file and then without closing fd you are using that file(not fd) in another shell command which reads its content. In this situation you will not get you contents for shell command as expected and if you try to debug you can't find the bug easily. you can also read more in my blog entry http://magnificentzps.blogspot.in/2014/04/importance-of-closing-file-descriptor.html
During the I/O process, data is buffered: this means that it is held in a temporary location before being written to the file.
Python doesn't flush the buffer—that is, write data to the file—until it's sure you're done writing. One way to do this is to close the file.
If you write to a file without closing, the data won't make it to the target file.
Python uses close() method to close the opened file. Once the file is closed, you cannot read/write data in that file again.
If you will try to access the same file again, it will raise ValueError since the file is already closed.
Python automatically closes the file, if the reference object has been assigned to some another file. Closing the file is a standard practice as it reduces the risk of being unwarrantedly modified.
One another way to solve this issue is.... with statement
If you open a file using with statement, a temporary variable gets reserved for use to access the file and it can only be accessed with the indented block. With statement itself calls the close() method after execution of indented code.
Syntax:
with open('file_name.text') as file:
#some code here

Force directory to be created in Python

I am running Python with MPI on a supercomputing cluster. I am getting strange nondeterministic behavior that I think is a result of I/O complications that are not present on the single machines I'm used to working with.
One of the things my code does is to create directories using os.makedirs somewhat frequently. I know also that I generally should not write small amounts of data to the filesystem-- this can end up with the data getting stuck in some buffer and not written for a long time. I suspect this may be happening with my directory creation calls, and then later code tries to write to files inside the directory before it exists. Two questions:
is creating a new directory effectively the same thing as writing a small amount of data?
When forcing data to be written, I use flush and os.fsync. These require a file object. Is there an equivalent to make sure the directory has been created?
Creating a new directory is effectively the same as writing small amount of data. It adds an inode.
The only way mkdir (or os.mkdirs) should fail is if the directory exists - otherwise the directory will always be created. In terms of the data being buffered - it's unlikely that this would happen - even journaled filesystems will sync out pretty regularly.
If you're having non-deterministic behavior, just wrap your directory creation / writing a file into that directory inside a try / except / finally that makes a few efforts? But really - the need for such code hints at something much more sinister and is likely a bigger issue.

Categories

Resources