Python - When to use file vs open

Python - When to use file vs open - python

What's the difference between file and open in Python? When should I use which one? (Say I'm in 2.5)

You should always use open().
As the documentation states:
When opening a file, it's preferable
to use open() instead of invoking this
constructor directly. file is more
suited to type testing (for example,
writing "isinstance(f, file)").
Also, file() has been removed since Python 3.0.

Two reasons: The python philosophy of "There ought to be one way to do it" and file is going away.
file is the actual type (using e.g. file('myfile.txt') is calling its constructor). open is a factory function that will return a file object.
In python 3.0 file is going to move from being a built-in to being implemented by multiple classes in the io library (somewhat similar to Java with buffered readers, etc.)

file() is a type, like an int or a list. open() is a function for opening files, and will return a file object.
This is an example of when you should use open:
f = open(filename, 'r')
for line in f:
process(line)
f.close()
This is an example of when you should use file:
class LoggingFile(file):
def write(self, data):
sys.stderr.write("Wrote %d bytes\n" % len(data))
super(LoggingFile, self).write(data)
As you can see, there's a good reason for both to exist, and a clear use-case for both.

Functionally, the two are the same; open will call file anyway, so currently the difference is a matter of style. The Python docs recommend using open.
When opening a file, it's preferable to use open() instead of invoking the file constructor directly.
The reason is that in future versions they is not guaranteed to be the same (open will become a factory function, which returns objects of different types depending on the path it's opening).

Only ever use open() for opening files. file() is actually being removed in 3.0, and it's deprecated at the moment. They've had a sort of strange relationship, but file() is going now, so there's no need to worry anymore.
The following is from the Python 2.6 docs. [bracket stuff] added by me.
When opening a file, it’s preferable to use open() instead of invoking this [file()] constructor directly. file is more suited to type testing (for example, writing isinstance(f, file)

According to Mr Van Rossum, although open() is currently an alias for file() you should use open() because this might change in the future.

Related

Python: Other options versus using 'contains`? I was told I should not use it

I have a working file [below], but I would like to know if there is a better solution to the first three lines.
I have several files in a folder, and a script that processes them based on a particular and conserved <string> in each file's name. However, I was told I should not use __contains__ (I am not a CS major, and don't fully understand why). Is there a better option? I could not find any other concise solution.
Thanks.
files = os.listdir (work_folder)
for i in files:
if i.__contains__('FOO'):
for i in range (number_of_files):
old_file = 'C:/path/to/file'
with open(merged_file, 'a+') as outfile:
with open(old_file) as infile:
for line in infile:
outfile.write(line)

Generally in Python, double-underscore methods should not be called directly; you should use the global functions or operators that correspond to them. In this case, you would do if 'FOO' in i.

It would be more usual to write
if 'FOO' in i:
instead of
if i.__contains__('FOO'):
However, I would go one further than that and suggest your use case is more suited to glob
import glob
foo_files = glob.glob(os.path.join(work_folder, '*FOO*'))

As Daniel Roseman explains, the double-underscore methods aren't there to be called by you, they're there to be called by the Python interpreter or standard library.
So, that's the main reason you shouldn't call them: It's not idiomatic, so it will confuse readers.
But all you know is that there must be some operation that you are intended to use, which Python will implement by calling the __contains__ method. You have no idea what that operation is. How do you find it?
Well, you could just go to Stack Overflow, and someone helpful like Daniel Roseman will tell you, of course. But you can also search for __contains__ in the Python documentation. What you'll find is this:
object.__contains__(self, item)
Called to implement membership test operators. Should return true if
item is in self, false otherwise.
So, self.__contains__(item) is there for Python to implement item in self.
And now you know what to write: 'FOO' in i.
And if you read on in those linked docs, you'll see that it isn't actually quite true that i.__contains__('FOO') does the same thing as 'FOO' in i. That's true for the most common cases (including where i is a string, as it is here), but if i doesn't have a __contains__ method, but is an iterable, or an old-style sequence, in will use those instead.
So, that's another reason not to directly call __contains__. If you later add some abstraction on top of strings, maybe a virtual iterable of grapheme clusters or something, it may not implement __contains__, but in will still work.

serialize class instance for comparison

I have an application which is working with projects. These projects are currently stored as pickles, generated with
cPickle.dump(project, open(filename, 'wb'), HIGHEST_PROTOCOL)
These project files need to be diffable because they are used in a version control environment.
The problem is, if I serialize the exact same object, the pickle turns out different every time. 0 as protocol works, but I need the files to be smaller (they are around 12MB with protocol 0).

I found the solution, I'm gonna post it here in case anyone has the same question in the future.
The solution is to do a deepcopy of the object directly before it's pickled.
That way the reference count which apparently causes the differences gets reset and the files turn out the same when using HIGHEST_PROTOCOL.
So instead of
cPickle.dump(instance, open(filename, 'wb'), HIGHEST_PROTOCOL)
you need to do this:
from copy import deepcopy
cpy = deepcopy(instance)
cPickle.dump(cpy, open(filename, 'wb'), HIGHEST_PROTOCOL)
cpy = None
That way the file size is reduced significantly while still maintaining comparability.

receive file object or file path

I know that in static languages it's always better to receive a file object rather than a string representing a path (from a software design standpoint). However, in a dynamic language like python where you can't see the type of the variable, what's the "correct" way to pass a file?
Isn't it problematic passing the function object since you need to remember to close it afterwards (which you probably won't since you can't see the type)?

Ideally you would be using the with statement whenever you open a file, so closing will be handled by that.
with open('filepath', 'r') as f:
myfunc(f)
otherstuff() # f is now closed
From the documentation:
It is good practice to use the with keyword when dealing with file
objects. This has the advantage that the file is properly closed after
its suite finishes, even if an exception is raised on the way.

Pass file object just like any other type.
def f(myfile):
myfile.write('asdf')
ofs = open(filepath, 'w') # ofs is file object
f(ofs) # passes file object
ofs.close()
Can also use closure.
def f():
return open(filepath, 'w') # returns file object
ofs = f()
ofs.write('something')
ofs.close()

However, in a dynamic language like python where you can't see the
type of the variable, what's the "correct" way to pass a file?
The short answer is - you don't.
In most object oriented languages, there is an object contract which guarantees that if the object has a method quack, it knows how to quack. Some languages are very strict in enforcing this contract (Java, for example) and others not so much.
In the end it comes down to one of Python's principles EAFP:
E asier to a sk for f orgiveness than p ermission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
LBYL = Look Before You Leap
What this means is that if your method is expecting a "file" (and you state this in your documentation), assume you are being passed a "file like object". Try to execute a file operation on the object (like read() or close()) and then catch the exception if its raised.
One of the main points of the EAFP approach is that you may be getting passed an object that works like a file, in other words - the caller knows what they are doing. So if you spend time checking for exact types, you'll have code that isn't working when it should. Now the burden is on the caller to meet your "object contract"; but what if they are not working with files but with an in-memory buffer (which have the same methods as files)? Or a request object (again, have the same file-like methods). You can't possibly check for all these variations in your code.
This is the preferred approach - instead of the LBYL approach, which would be type checking first.
So, if your method's documentation states that its expecting a file object, it should work with any object that is "file like", but when someone passes it a string to a file path, your method should raise an appropriate exception.
Also, and more importantly - you should avoid closing the object in your method; because it may not be a "file" like explained earlier. However if you absolutely must, make sure the documentation for your method states this very clearly.
Here is an example:
def my_method(fobj):
''' Writes to fobj, which is any file-like object,
and returns the object '''
try:
fobj.write('The answer is: {}\n'.format(42))
except (AttributeError, TypeError):
raise TypeError('Expected file-like object')
return fobj

You can use file objects in Python. When they are (automatically) garbage collected, the file will be closed for you.
File objects are implemented using C’s stdio package and can be created with the built-in open() function.

How to test if a file has been created by pickle?

Is there any way of checking if a file has been created by pickle? I could just catch exceptions thrown by pickle.load but there is no specific "not a pickle file" exception.

Pickle files don't have a header, so there's no standard way of identifying them short of trying to unpickle one and seeing if any exceptions are raised while doing so.
You could define your own enhanced protocol that included some kind of header by subclassing the Pickler() and Unpickler() classes in the pickle module. However this can't be done with the much faster cPickle module because, in it, they're factory functions, which can't be subclassed [1].
A more flexible approach would be define your own independent classes that used corresponding Pickler() and Unpickler() instances from either one of these modules in its implementation.
Update
The last byte of all pickle files should be the pickle.STOP opcode, so while there isn't a header, there is effectively a very minimal trailer which would be a relatively simple thing to check.
Depending on your exact usage, you might be able to get away with supplementing that with something more elaborate (and longer than one byte), since any data past the STOP opcode in a pickled object's representation is ignored [2].
[1] Footnote [2] in the Python 2 documentation.
[2] Documentation forpickle.loads(), which also applies to pickle.load()since it's currently implemented in terms of the former.

There is no sure way other than to try to unpickle it, and catch exceptions.

I was running into this issue and found a fairly decent way of doing it. You can use the built in pickletools module to deconstruct a pickle file and get the pickle operations. With pickle protocol v2 and higher the first opcode will be a PROTO name and the last one as #martineau mentioned is STOP the following code will display these two opcodes. Note that output in this example can be iterated but opcodes can not be directly accessed thus the for loop.
import pickletools
with open("file.pickle", "rb") as f:
pickle = f.read()
output = pickletools.genops(pickle)
opcodes = []
for opcode in output:
opcodes.append(opcode[0])
print(opcodes[0].name)
print(opcodes[-1].name)

What is the accepted python alternative to C++ overloaded input stream operators?

In C++, you can do this to easily read data into a class:
istream& operator >> (istream& instream, SomeClass& someclass) {
...
}
In python, the only way I can find to read from the console is the "raw_input" function, which isn't very adaptable to this sort of thing. Is there a pythonic way to go about this?

You are essentially looking for deserialization. Python has a myriad of options for this depending on the library used. The default is python pickling. There are lots of other options you can have a look here.

No, there's no widespread Pythonic convention for "read the next instance of class X from this open input text file". I believe this applies to most languages, including e.g. Java; C++ is kind of the outlier there (and many C++ shops forbid the operator>> use in their local style guides). Serialization (to/from JSON or XML if you need allegedly-human readable text files), suggested by another answer, is one possible approach, but not too hot (no standardized way to serialize completely general class instances to either XML or JSON).

Rather than use raw_input, you can read from sys.stdin (a file-like object):
import sys
input_line = sys.stdin.readline()
# do something with input_line

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.