Use case for low-level os.open, os.fdopen, and friends? - python

In Python 3.2 (and other versions), the documentation for os.open states:
This function is intended for low-level I/O. For normal usage, use the built-in function open(), which returns a file object with read() and write() methods (and many more). To wrap a file descriptor in a file object, use fdopen().
And for fdopen():
Return an open file object connected to the file descriptor fd. This is an alias of open() and accepts the same arguments. The only difference is that the first argument of fdopen() must always be an integer.
This comment in a question on the difference between io.open and os.open (this difference is entirely clear to me, I always use io.open, never os.open) asks: why would anyone choose Python for low-level I/O?, but doesn't really get an answer.
My question is very similar to the comment-question: In Python, what is the use case of low-level I/O through os.open, os.fdopen, os.close, os.read, etc.? I used to think it was needed to deamonise a process, but I'm not so sure anymore. Is there any task that can only be performed using low-level I/O, and not with the higher-level wrappers?

I use it when I need to use O_CREAT | O_EXCL to atomically create a file, failing if the file exists. You can't check for file existence then create the file if your test found that it does not exist, because that will create a race condition where the file could be created in the interim period between your check and creation.
Briefly looking at the link you provided, I do believe pidfile creation has a race condition.
In Python 3.3, there is a new 'x' mode added to open() that seems to do this. I haven't tried it, though.

Major Differences:
Low level access to files is unbuffered
Low level access is not portable
Low level allows more fine grained control, e.g. whether to block or not to block upon read
Use cases for low level io:
The file is a block device
The file is a socket
The file is a tty
...
In all these cases you might wish to have that more fine grained control (over buffering and blocking behavior).
You probably never will need the low level functions for regular files. I think most of the time the use case will be some device driver stuff. However, this would better be done in C. But I can see the use case for python as well, e.g. for fast prototyping of device drivers.

Related

What is the Python equivalent of FreeRTOS Streambuffer?

I need a structure equivalent to FreeRTOS's Streambuffer, but in Python. Ideally I could write something like:
myBuf=StreamBuffer(maxSize=2048)
and then be able to call:
read_bytes=myBuf.read(amount_to_read, timeout=300)
myBuf.write(b"fooo", timeout=20)
such that the operations are blocking whenever I specifiy (a non-zero) timeout, until there is enough bytes to read or enough space to write, and that would return immediately with an error code or throw an exception if timeout is zero (ie non-blocking mode) or exceeded.
I tried experimenting with io.BytesIO, but I don't know how to use it. There does not seem a way to specify a limit on its size or an ability to make it blocking.
Does anything standard exist, or does anyone know of any useful packages that provide this functionality in a simple/fast way?
Update:
From Here it seems I can use io.BufferedRWPair, but there are no examples on things such as setting timeouts and configuring non-/blocking behavior. I also don't understand the warning that implies one should use io.BufferedRandom instead.
Update2:
I realize Python provides collections.queue which is pretty much analogous to FreeRTOS queues, but is there an equivalent for streams, so that one can just keep pushing bytes from one end and pop them off from the other?
Update 3: Seems an os.pipe or multiprocessing.pipe is closest to what I'm looking for.

Python open() vs. .close()

Regarding syntax in Python
Why do we use open("file") to open but not "file".close() to close it?
Why isn't it "file".open() or inversely close("file")?
It's because open() is a function, and .close() is an object method. "file".open() doesn't make sense, because you're implying that the open() function is actually a class or instance method of the string "file". Not all strings are valid files or devices to be opened, so how the interpreter is supposed to handle "not a file-like device".open() would be ambiguous. We don't use "file".close() for the same reason.
close("file") would require a lookup of the file name, then another lookup to see if there are file handles, owned by the current process, attached to that file. That would be very inefficient and probably has hidden pitfalls that would make it unreliable (for example, what if it's not a file, but a TTY device instead?). It's much faster and simpler to just keep a reference to the opened file or device and close the file through that reference (also called a handle).
Many languages take this approach:
f = open("file") # open a file and return a file object or handle
# stuff...
close(f) # close the file, using the file handle or object as a reference
This looks similar to your close("file") construct, but don't be fooled: it's closing the file through a direct reference to it, not the file name as stored in a string.
The Python developers have chosen to do the same thing, but it looks different because they have implemented it with an object-oriented approach instead. Part of the reason for this is that Python file objects have a lot of methods available to them, such as read(), flush(), seek(), etc. If we used close(f), then we would have to either change all of the rest of the file object methods to functions, or let it be one random function that behaves differently from the rest for no good reason.
TL;DR
The design of open() and file.close() is consistent with OOP principals and good file reference practices. open() is a factory-like function that creates objects that reference files or other devices. Once the object is created, then all other operations on that object are done through class or instance methods.
Normally you shouldn't use "file".close() explicitly but use open(file) as contextmanager so the file handle is also closed if an exception happened. End of problem :-)
But to actually answer your question I assume the reason is that open supports many options and the returned class differs depending on these options (see also io module). So it would be simply much more complicated for the end-user to remember which class he wants and then use the "class".open with the right class itself. Note you can also pass "integer file descriptor of the file to be wrapped." to open, this would mean besides having a str.open() method you also get a int.open(). This would be really bad OO design but also confusing. I wouldn't care to guess what kind of questions would be asked on StackOverflow about that ("door".open(), (1).open())...
However I must admit that there is a pathlib.Path.open function. But if you have a Path it isn't ambiguous anymore.
As to a close() function: Each instance will have a close() method already and there are no differences between the different classes, so why create an additional function? There is simply no advantage.
Only slightly less new than you but I'll give this one a go, basically opening and closing are pretty different actions in a language like python. When you are opening the file what you are really doing is creating an object to be worked within your application that represents the file, so you create it with a function that informs the OS that the file has been opened and creates an object that python can use to read and write to the file. When it comes time to close the file what basically needs to be done is for your app to tell the OS that it is done with the file and dispose of the object that represented the file fro memory, and the easiest way to do that is with a method on the object itself. Also note that a syntax like "file".open would require the string type to include methods for opening files, which would be a very strange design and require a lot of extensions on the string type for anything else you wanted to implement with that syntax. close(file) would make a bit more sense but would still be a clunky way of releasing that object/letting the OS know the file was no longer open, and you would be passing a variable file representing the object created when you opened the file rather than a string pointing to the file's path.
In addition to what has been said, I quote the changelog in Python removing the built in file type. It (briefly) explains why the class constructor approach using the file type (available in Python 2) was removed in Python 3:
Removed the file type. Use open(). There are now several different kinds of streams that open can return in the io module.
Basically, while file("filename") would create an instance of file, open("filename") can return instances of different stream classes, depending on the mode.
https://docs.python.org/3.4/whatsnew/3.0.html#builtins

Is there an easy way to use a python tempfile in a shelve (and make sure it cleans itself up)?

Basically, I want an infinite size (more accurately, hard-drive rather than memory bound) dict in a python program I'm writing. It seems like the tempfile and shelve modules are naturally suited for this, however, I can't see how to use them together in a safe manner. I want the tempfile to be deleted when the shelve is GCed (or at guarantee deletion after the shelve is out of use, regardless of when), but the only solution I can come up with for this involves using tempfile.TemporaryFile() to open a file handle, getting the filename from the handle, using this filename for opening a shelve, keeping the reference to the file handle to prevent it from getting GCed (and the file deleted), and then putting a wrapper on the shelve that stores this reference. Anyone have a better solution than this convoluted mess?
Restrictions: Can only use the standard python library and must be fully cross platform.
I would rather inherit from shelve.Shelf, and override the close method (*) to unlink the files. Notice that, depending on the specific dbm module being used, you may have more than one file that contains the shelf. One solution could be to create a temporary directory, rather than a temporary file, and remove anything in the directory when done. The other solution would be to bind to a specific dbm module (say, bsddb, or dumbdbm), and remove specifically those files that these libraries create.
(*) notice that the close method of a shelf is also called when the shelf is garbage collected. The only case how you could end up with garbage files is when the interpreter crashes or gets killed.

Lightweight crash recovery for Python

What would be the best way to handle lightweight crash recovery for my program?
I have a Python program that runs a number of test cases and the results are stored in a dictionary which serves as a cache. If I could save (and then restore) each item that is added to the dictionary, I could simply run the program again and the caching would provide suitable crash recovery.
You may assume that the keys and values in the dictionary are easily convertible to strings ie. using either str or the pickle module.
I want this to be completely cross platform - well at least as cross platform as Python is
I don't want to simply write out each value to a file and load it in my program might crash while I am writing the file
UPDATE: This is intended to be a lightweight module so a DBMS is out of the question.
UPDATE: Alex is correct in that I don't actually need to protect against crashes while writing out, but there are circumstances where I would like to be able to manually terminate it in a recoverable state.
UPDATE Added a highly limited solution using standard input below
There's no good way to guard against "your program crashing while writing a checkpoint to a file", but why should you worry so much about that?! What ELSE is your program doing at that time BESIDES "saving checkpoint to a file", that could easily cause it to crash?!
It's hard to beat pickle (or cPickle) for portability of serialization in Python, but, that's just about "turning your keys and values to strings". For saving key-value pairs (once stringified), few approaches are safer than just appending to a file (don't pickle to files if your crashes are far, far more frequent than normal, as you suggest tjey are).
If your environment is incredibly crash-prone for whatever reason (very cheap HW?-), just make sure you close the file (and fflush if the OS is also crash-prone;-), then reopen it for append. This way, worst that can happen is that the very latest append will be incomplete (due to a crash in the middle of things) -- then you just catch the exception raised by unpickling that incomplete record and redo only the things that weren't saved (because they weren't completed due to a crash, OR because they were completed but not fully saved due to a crash, comes to much the same thing in the end).
If you have the option of checkpointing to a database engine (instead of just doing so to files), consider it seriously! The DB engine will keep transaction logs and ensure ACID properties, making your application-side programming much easier IF you can count on that!-)
The pickle module supports serializing objects to a file (and loading from file):
http://docs.python.org/library/pickle.html
One possibility would be to create a number of smaller files ... each representing a subset of the state that you're trying to preserve and each with a checksum or tag indicating that it's complete as the last line/datum of the file (just before the file is closed).
If the checksum/tag is good then the rest of the data can be considered valid ... though program would then have to find all of these files, open and read all of them, and use meta data you've provided (in their headers or their names?) to determine which ones constitute the most recent cohesive state representation (or checkpoint) from which you can continue processing.
Without knowing more about the nature of the data that you're working with it's impossible to be more specific.
You can use files, of course, or you could use a DBMS system just about as easily. Any decent DBMS (PostgreSQL, MySQL if you're using the proper storage back-ends) can give you ACID guarantees and transactional support. So the data you read back should always be consistent with the constraints that you put in your schema and/or with the transactions (BEGIN, COMMIT, ROLLBACK) that you processed.
A possible advantage of posting your serialized date to a DBMS is that you can host the DBMS on a separate system (which is unlikely to suffer the same instabilities as your test host at the same times).
Pickle/cPickle have problems.
I use the JSON module to serialize objects out. I like it because not only does it work on any OS, but it will work fine in other programming languages, too; many other languages and platforms have readily-accessible JSON deserialization support, which makes it easy to use the same objects in different programs.
Solution with severe restrictions
If I don't worry about it crashing while writing out and I only want to allow manual termination, I can use standard output to control this. Unfortunately, this can only terminate the program when a control point is reached. This could be solved by creating a new thread to read standard input. This thread could use a global lock to check if the main thread is inside a critical section (writing to a file) and terminate the program if this is not the case.
Downsides:
This is reasonably complex
It adds an extra thread
It stops me using standard input for anything else

How do I dump an entire Python process for later debugging inspection?

I have a Python application in a strange state. I don't want to do live debugging of the process. Can I dump it to a file and examine its state later? I know I've restored corefiles of C programs in gdb later, but I don't know how to examine a Python application in a useful way from gdb.
(This is a variation on my question about debugging memleaks in a production system.)
There is no builtin way other than aborting (with os.abort(), causing the coredump if resource limits allow it) -- although you can certainly build your own 'dump' function that dumps relevant information about the data you care about. There are no ready-made tools for it.
As for handling the corefile of a Python process, the Python source has a gdbinit file that contains useful macros. It's still a lot more painful than somehow getting into the process itself (with pdb or the interactive interpreter) but it makes life a little easier.
If you only care about storing the traceback object (which is all you need to start a debugging session), you can use debuglater (a fork of pydump). It works with recent versions of Python and has a IPython/Jupyter integration.
If you want to store the entire session, look at dill. It has a dump_session, and load_session functions.
Here are two other relevant projects:
python-checkpointing2
pycrunch-trace
If you're looking for a language agnostic solution, you want to create a core dump file. Here's an example with Python.
Someone above said that there is no builtin way to perform this, but that's not entirely true. For an example, you could take a look at the pylons debugging tools. Whene there is an exception, the exception handler saves the stack trace and prints a URL on the console that can be used to retrieve the debugging session over HTTP.
While they're probably keeping these sessions in memory, they're just python objects, so there's nothing to stop you from pickling a stack dump and restoring it later for inspection. It would mean some changes to the app, but it should be possible...
After some research, it turns out the relevant code is actually coming from Paste's EvalException module. You should be able to look there to figure out what you need.
It's also possible to write something that would dump all the data from the process, e.g.
Pickler that ignores the objects it can't pickle (replacing them with something else) (e.g. Python: Pickling a dict with some unpicklable items)
Method that recursively converts everything into serializable stuff (e.g. this, except it needs a check for infinitely recursing objects and do something with those; also it could try dir() and getattr() to process some of the unknown objects, e.g. extension classes).
But leaving a running process with manhole or pylons or something like that certainly seems more convenient when possible.
(also, I wonder if something more convenient was written since this question was first asked)
This answer suggests making your program core dump and then continuing execution on another sufficiently similar box.

Categories

Resources