"an integer is required" when open()'ing a file as utf-8? - python

I have a file I'm trying to open up in python with the following line:
f = open("C:/data/lastfm-dataset-360k/test_data.tsv", "r", "utf-8")
Calling this gives me the error
TypeError: an integer is required
I deleted all other code besides that one line and am still getting the error. What have I done wrong and how can I open this correctly?

From the documentation for open():
open(name[, mode[, buffering]])
[...]
The optional buffering argument specifies the file’s desired buffer
size: 0 means unbuffered, 1 means line buffered, any other positive
value means use a buffer of (approximately) that size. A negative
buffering means to use the system default, which is usually line
buffered for tty devices and fully buffered for other files. If
omitted, the system default is used.
You appear to be trying to pass open() a string describing the file encoding as the third argument instead. Don't do that.

You are using the wrong open.
>>> help(open)
Help on built-in function open in module __builtin__:
open(...)
open(name[, mode[, buffering]]) -> file object
Open a file using the file() type, returns a file object. This is the
preferred way to open a file. See file.__doc__ for further information.
As you can see it expects the buffering parameter which is a integer.
What you probably want is codecs.open:
open(filename, mode='rb', encoding=None, errors='strict', buffering=1)

From the help docs:
open(...)
open(file, mode='r', buffering=-1, encoding=None,
errors=None, newline=None, closefd=True) -> file object
you need encoding='utf-8'; python thinks you are passing in an argument for buffering.

The last parameter to open is the size of the buffer, not the encoding of the file.
File streams are more or less encoding-agnostic (with the exception of newline translation on files not open in binary mode), you should handle encoding elsewhere (e.g. when you get the data with a read() call, you can interpret it as utf-8 using its decode method).

This resolved my issue, ie providing an encoding(utf-8) while opening the file
with open('tomorrow.txt', mode='w', encoding='UTF-8', errors='strict', buffering=1) as file:
file.write(result)

Related

Why is calling file write() method not working?

I have some Python3 code that opens a file in write mode, writes something to it, and then closes the file. The filename is an int. For some reason, the code doesn't work as expected. When I run the f.write() statement, a 6 is printed to the screen. And when I run the f.close() statement, the string that was supposed to be written is printed to the screen.
>>> f = open(2, 'w')
>>> f.write('FooBar')
6
>>> f.close()
FooBar>>>
>>>
I checked the directory that I ran this in and the file (named 2) was not created. Can anyone explain what is going on? I suspect it has to do with the filename being an int, but I'm not sure.
You're passing in a file descriptor number (2 for stderr).
See the documentation for open(), emphasis mine:
file is a path-like object giving the pathname (absolute or relative to the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped.
As to why nothing happens before .close() (or .flush(): Your stream is line buffered, and you're not writing a newline.
f = open(2, 'wb', buffering=0)
to disable buffering.
If you wish to write to a file called '2', you must pass a string.
f = open('2', 'w')
Alternatively to a file name (type str) open also accepts a file descriptor (type int). 2 is the file descriptor of stderr on Linux, so you are opening the standard error stream for writing, so everything you write to that file object will appear in your terminal/console! The reason it appears only after you do file.close() is that by default the write content isn't immediately written to the file but rather kept in a buffer which gets written only when a newline \n is encountered in the write content, and of course when the file is closed. You can force a writeout to file by calling file.flush().
The reason for the 6 you get on screen is that the return value of file.write is always the number of characters that has been written.
In case you wanted to create a file with the name 2 in the current working directory, you need to wrap the 2 in quotes:
>>> f = open("2", 'w')
>>> f.write('FooBar')
6
>>> f.close()
>>>

Opening a file in Python return a stream?

I have a simple qustion related opening file in Python.
Doing something like this:
x = open('test.txt', 'rt')
print(x)
I obtain this output:
<_io.TextIOWrapper name='test.txt' mode='rt' encoding='cp1252'>
that is the Python object representing the opened file. Is it considerable a stream or not? What exactly represents a stream in Python?
According to official documents of python 3, it is a stream.
The easiest way to create a text stream is with open(), optionally specifying an encoding:
f = open("myfile.txt", "r", encoding="utf-8")
According to the Text I/O section of the docs,
The easiest way to create a text stream is with open(), optionally specifying an encoding
Which seems to indicate an affirmative to the question of whether it is considerable a stream. Whether the usage of the term ‘stream’ here is consistent with that in other languages is indeterminate.
It also bears noting the object returned, and consequently its characteristics, is subject to the mode used,
The type of file object returned by the open() function depends on the mode. When open() is used to open a file in a text mode ('w', 'r', 'wt', 'rt', etc.), it returns a subclass of io.TextIOBase (specifically io.TextIOWrapper). When used to open a file in a binary mode with buffering, the returned class is a subclass of io.BufferedIOBase. The exact class varies: in read binary mode, it returns an io.BufferedReader; in write binary and append binary modes, it returns an io.BufferedWriter, and in read/write mode, it returns an io.BufferedRandom. When buffering is disabled, the raw stream, a subclass of io.RawIOBase, io.FileIO, is returned.
io.BufferedIOBase, io.RawIOBase, and io.TextIOBase explicitly state in their documentation that they are base classes for streams.

Flush data written to numeric file handle?

How can I flush the content written to a file opened as a numeric file handle?
For illustration, one can do the following in Python:
f = open(fn, 'w')
f.write('Something')
f.flush()
On the contrary, I am missing a method when doing the following:
import os
fd = os.open(fn)
os.pwrite(fd, buffer, offset)
# How do I flush fd here?
Use os.fsync(fd). See docs for fsync.
Be careful if you do fsync on a file descriptor obtained from a python file object. In that case you need to flush the python file object first.

what is the difference between "r" and .read() in Python?

I want to open a text file and use split
Here is the code I wrote at first:
with open("test.txt","r") as file:
file.split("\n")
and here is another code I wrote because the first code I wrote didn't work:
txt=open("test.txt")
file=txt.read()
file.split("\n")
what is the difference between "r" and .read()?
The .read() function is for reading data from a file; So the file should be in read mode and the read mode is 'r' that you asked.
So 'r'is Mode for File and .read() is a function for reading data.
read() is the actual function that does the reading of any "path-like object," returning a "file-like object" (this is due to the principle of duck typing). You can optionally pass it a parameter, which is a single character, indicating what "mode" to open the path-like object. Look at the signature for read():
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
You can see that the default mode is 'r', thus, if you do not specify a mode, it will default to 'r' anyways, so including 'r' as you did is generally redundant.
The documentation is here
The r, You can think of it as the purpose of opening a file. if you open a file with r, and then you can't do write with the handler! You should got some error as :
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IOError: File not open for writing
read() just a way which you can got your data from file handler! there are also have readline() available.

Don't convert newline when reading a file

I'm reading a text file:
f = open('data.txt')
data = f.read()
However newline in data variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').
How can I instruct Python to read the file as is?
In Python 2.x:
f = open('data.txt', 'rb')
As the docs say:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
In Python 3.x, there are three alternatives:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes instead of str, which you will have to explicitly decode to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str object is; in 3.x str is Unicode.)
f2 = open('data.txt', 'r', newline='')
This will return str, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline and friends will treat '\r\n' as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
The reason you need to specify an explicit encoding for f3 is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)" to "don't decode, and return raw bytes instead of str". Again, from the docs:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
However:
'encoding' … should only be used in text mode.
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument.
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes, obviously f and f1are the same. But if you want to deal instr, as appropriate for each version, the simplest answer is to write different code for each, probablyfandf2`, respectively. If this comes up a lot, consider writing either wrapper function:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False) almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII' in 2.x. Using locale.getpreferredencoding(True) is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
In either case, the 'r' means "read-only". If you don't specify a mode, the default is 'r', so the binary-mode equivalent to the default is 'rb'.
You need to open the file in the binary mode:
f = open('data.txt', 'rb')
data = f.read()
('r' for "read", 'b' for "binary")
Then everything is returned as is, nothing is normalized
You can use the codecs module to write 'version-agnostic' code:
Underlying encoded files are always opened in binary mode. No automatic conversion of '\n' is done on reading and writing. The mode argument may be any binary mode acceptable to the built-in open() function; the 'b' is automatically added.
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()
Just request "read binary" in the open:
f = open('data.txt', 'rb')
data = f.read()
Open the file using open('data.txt', 'rb'). See the doc.

Categories

Resources