Overhead of passing around a file object instead of file name?

Overhead of passing around a file object instead of file name? - python

I have a method that detects what file it should open based on input, opens the file, then returns the file object.
def find_and_open(search_term):
# ... logic to find file
return open(filename, 'r')
I like this way because it hides the most amount of implementation from the caller. You give it your criteria, it spits out the file object. Why bother with String paths if you're just going to open it anyway?
However, in other Python projects I tend to see such a method return a string of the filepath, instead of the fileobject itself. The file is then opened at the very last minute, read/edited, and closed.
My questions are:
from a performance standpoint, does passing around file objects carry more overhead? I suppose a reference is a reference no matter what it points to, but perhaps there's something going on in the interpreter that makes a String reference faster to pass around than a file reference?
From a purely "Pythonic" standpoint, does it make more sense to return the file object, or the String path (and then open the file as late as possible)?

Performance is unlikely to be an issue. Reading and writing to disk are orders of magnitude slower than reading from RAM, so passing a pointer around is very unlikely to be the performance bottleneck. 1
From the python docs:
It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks ...
Note that you can both use with to open a file and pass the file object around to other functions, either by nesting functions or using yield. Though I would consider this less pythonic than passing a file string around in most cases.
Simple is better than complex.
Flat is better than nested.
You might also be interested in pathlib.

Related

How do I make a file object that operates on only part of the underlying file?

I have a binary file consisting of an N-byte header, followed by a data block. I have input describing the locations of various bits of data, as offsets from the start of the data block; i.e. not the start of the file itself.
I want to open the file in such a way that seek() calls (and anything similar) seek within the data block, not the whole file.
Some options I've considered:
mmap. Close, but its offset argument is limited to multiples of pagesize.
Subclass file objects and override seek() to add the header length. Google suggests this may be annoying to do right.
Slurp the whole file minus the header, then make a bytesio. Problematic if the file is huge.
Patch seek() on the opened file object. I think this might work but monkeypatching makes me nervous.
I think what I really want is something that looks and acts like a file object but only "sees" a portion of the underlying file. Is there a good way to do that?

Rather than inheritance and monkey-patching, you forgot a simpler option: implement a file-like class which seeks the way you want. Then you'll have full control and non-leaky encapsulation.

mmap the entire file, and then use a memoryview of the data block you are interested in. Then offsets into the memory view will be relative to the data block.

Python open() vs. .close()

Regarding syntax in Python
Why do we use open("file") to open but not "file".close() to close it?
Why isn't it "file".open() or inversely close("file")?

It's because open() is a function, and .close() is an object method. "file".open() doesn't make sense, because you're implying that the open() function is actually a class or instance method of the string "file". Not all strings are valid files or devices to be opened, so how the interpreter is supposed to handle "not a file-like device".open() would be ambiguous. We don't use "file".close() for the same reason.
close("file") would require a lookup of the file name, then another lookup to see if there are file handles, owned by the current process, attached to that file. That would be very inefficient and probably has hidden pitfalls that would make it unreliable (for example, what if it's not a file, but a TTY device instead?). It's much faster and simpler to just keep a reference to the opened file or device and close the file through that reference (also called a handle).
Many languages take this approach:
f = open("file") # open a file and return a file object or handle
# stuff...
close(f) # close the file, using the file handle or object as a reference
This looks similar to your close("file") construct, but don't be fooled: it's closing the file through a direct reference to it, not the file name as stored in a string.
The Python developers have chosen to do the same thing, but it looks different because they have implemented it with an object-oriented approach instead. Part of the reason for this is that Python file objects have a lot of methods available to them, such as read(), flush(), seek(), etc. If we used close(f), then we would have to either change all of the rest of the file object methods to functions, or let it be one random function that behaves differently from the rest for no good reason.
TL;DR
The design of open() and file.close() is consistent with OOP principals and good file reference practices. open() is a factory-like function that creates objects that reference files or other devices. Once the object is created, then all other operations on that object are done through class or instance methods.

Normally you shouldn't use "file".close() explicitly but use open(file) as contextmanager so the file handle is also closed if an exception happened. End of problem :-)
But to actually answer your question I assume the reason is that open supports many options and the returned class differs depending on these options (see also io module). So it would be simply much more complicated for the end-user to remember which class he wants and then use the "class".open with the right class itself. Note you can also pass "integer file descriptor of the file to be wrapped." to open, this would mean besides having a str.open() method you also get a int.open(). This would be really bad OO design but also confusing. I wouldn't care to guess what kind of questions would be asked on StackOverflow about that ("door".open(), (1).open())...
However I must admit that there is a pathlib.Path.open function. But if you have a Path it isn't ambiguous anymore.
As to a close() function: Each instance will have a close() method already and there are no differences between the different classes, so why create an additional function? There is simply no advantage.

Only slightly less new than you but I'll give this one a go, basically opening and closing are pretty different actions in a language like python. When you are opening the file what you are really doing is creating an object to be worked within your application that represents the file, so you create it with a function that informs the OS that the file has been opened and creates an object that python can use to read and write to the file. When it comes time to close the file what basically needs to be done is for your app to tell the OS that it is done with the file and dispose of the object that represented the file fro memory, and the easiest way to do that is with a method on the object itself. Also note that a syntax like "file".open would require the string type to include methods for opening files, which would be a very strange design and require a lot of extensions on the string type for anything else you wanted to implement with that syntax. close(file) would make a bit more sense but would still be a clunky way of releasing that object/letting the OS know the file was no longer open, and you would be passing a variable file representing the object created when you opened the file rather than a string pointing to the file's path.

In addition to what has been said, I quote the changelog in Python removing the built in file type. It (briefly) explains why the class constructor approach using the file type (available in Python 2) was removed in Python 3:
Removed the file type. Use open(). There are now several different kinds of streams that open can return in the io module.
Basically, while file("filename") would create an instance of file, open("filename") can return instances of different stream classes, depending on the mode.
https://docs.python.org/3.4/whatsnew/3.0.html#builtins

Why can you close() a file object more than once?

This question is purely out of curiosity. Brought up in a recent discussion for a question here, I've often wondered why the context manager (with) does not throw an error when people explicitly close the file anyway through misunderstanding... and then I found that you can call close() on a file multiple times with no error even without using with.
The only thing we can find relating to this is here and it just blandly says (emphasis mine):
close( )
Close the file. A closed file cannot be read or written any more. Any operation which requires that the file be open will raise a ValueError after the file has been closed. Calling close() more than once is allowed.
It would appear that this is intentional by design but, if you can't perform any operations on the closed file without an exception, we can't work out why closing the file multiple times is permitted. Is there a use case?

Thinking about with is just wrong. This behaviour has been in Python forever and thus it's worth keeping it for backward compatibility.
Because it would serve no purpose to raise an exception. If you have an actual bug in your code where you might close the file before finishing using it, you'll get an exception when using the read or write operations anyway, and thus you'll never reach the second call to close.
Allowing this will rarely make the code easier to write avoiding adding lots of if not the_file.isclosed(): the_file.close().
The BDFL designed the file objects in that way and we are stuck with that behaviour since there's no strong reason to change it.

Resource management is a big deal. It's part of the reason why we have context managers in the first place.
I'm guessing that the core development team thought that it is better to make it "safe" to call close more than once to encourage people to close their files. Otherwise, you could get yourself into a situation where you're asking "Was this closed before?". If .close() couldn't be called multiple times, the only option would be to put your file.close() call in a try/except clause. That makes the code uglier and (frankly), lots of people probably would just delete the call to file.close() rather than handle it properly. In this case, it's just convenient to be able to call file.close() without worrying about any consequences since it's almost certain to succeed and leave you with a file that you know is closed afterward.

Function call fulfilled its promise - after calling it file is closed. It's not like something failed, it's just you did not have to do anything at all to ensure condition you we're asking for.

Use case for low-level os.open, os.fdopen, and friends?

In Python 3.2 (and other versions), the documentation for os.open states:
This function is intended for low-level I/O. For normal usage, use the built-in function open(), which returns a file object with read() and write() methods (and many more). To wrap a file descriptor in a file object, use fdopen().
And for fdopen():
Return an open file object connected to the file descriptor fd. This is an alias of open() and accepts the same arguments. The only difference is that the first argument of fdopen() must always be an integer.
This comment in a question on the difference between io.open and os.open (this difference is entirely clear to me, I always use io.open, never os.open) asks: why would anyone choose Python for low-level I/O?, but doesn't really get an answer.
My question is very similar to the comment-question: In Python, what is the use case of low-level I/O through os.open, os.fdopen, os.close, os.read, etc.? I used to think it was needed to deamonise a process, but I'm not so sure anymore. Is there any task that can only be performed using low-level I/O, and not with the higher-level wrappers?

I use it when I need to use O_CREAT | O_EXCL to atomically create a file, failing if the file exists. You can't check for file existence then create the file if your test found that it does not exist, because that will create a race condition where the file could be created in the interim period between your check and creation.
Briefly looking at the link you provided, I do believe pidfile creation has a race condition.
In Python 3.3, there is a new 'x' mode added to open() that seems to do this. I haven't tried it, though.

Major Differences:
Low level access to files is unbuffered
Low level access is not portable
Low level allows more fine grained control, e.g. whether to block or not to block upon read
Use cases for low level io:
The file is a block device
The file is a socket
The file is a tty
...
In all these cases you might wish to have that more fine grained control (over buffering and blocking behavior).
You probably never will need the low level functions for regular files. I think most of the time the use case will be some device driver stuff. However, this would better be done in C. But I can see the use case for python as well, e.g. for fast prototyping of device drivers.

Python - Things I shouldn't be doing?

I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?

As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).

First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.

from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants

That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.

If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.

I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overhead of passing around a file object instead of file name? - python

Related

How do I make a file object that operates on only part of the underlying file?

Python open() vs. .close()

Why can you close() a file object more than once?

Use case for low-level os.open, os.fdopen, and friends?

Python - Things I shouldn't be doing?

Categories

Resources