How to deal with invalid utf8 in fileinput?

How to deal with invalid utf8 in fileinput? - python

I have basically the following code:
def main():
for filename in fileinput.input():
filename = filename.strip()
process_file(filename)
The script takes a newline-separated list of file names as its input. However, some of the file names contain invalid utf8, which causes fileinput.input() to implode. I've read about the surrogateescape error handler, which I think is what I want, but I don't know how to set the error handler for fileinput.
In short: how do I get fileinput to deal with invalid Unicode?

filenames on POSIX may be arbitrary sequences of bytes (except b'\0' and b'/') i.e., no character encoding can decode them in the general case (that is why os.fsdecode() exists that uses surrogateescape error handler).
You could use a binary mode to read the filenames then either skip undecodable filenames if the input shouldn't contain them or pass them as is (or os.fsdecode()) to functions that expect filenames:
for filename in fileinput.input(mode='rb'):
process_file(os.fsdecode(filename).strip())
Beware, there were several known Python bugs related to using a binary mode and fileinput e.g.:
fileinput should use stdin.buffer for "rb" mode
fileinput.FileInput.readline() always returns str object at the end even if in 'rb' mode

Following documentation please use opening hook:
def main():
for filename in fileinput.input(openhook=fileinput.hook_encoded("utf-8")):
filename = filename.strip()
process_file(filename)

Related

Python 3 loop through subprocess output to search for filename

I am running a terminal command to list a directory, I would like to loop through each line returned and search for a particular filename, I have this so far...
import subprocess
for line in subprocess.check_output(['ls', '-l']):
if "(myfile.txt)" in line:
print("File Found")
But this is just outputing the list and doesn't seem to be searching for the file, anyone have an example they can point me at?

Calling ls from within subprocess would return a Bytes Object.
So, first, You might want to convert the returned value to a String.
And then split the String with New-Line ("\n") as delimiter.
Afterwards, you can iterate and search for your Needle in the List-Values.
import subprocess
# CALLING "subprocess.check_output(['ls', '-l']" RETURNS BYTES OBJECT...
# SO WE DECODE THE BYTES INTO STRING, FIRST
# AND THEN SPLIT AT THE NEW-LINE BOUNDARY TO CONVERT IT TO A LIST
for line in bytes.decode(subprocess.check_output(['ls', '-l'])).split(sep="\n"):
# NOW WE CAN CHECK IF THE DESIRED FILE IS IN THE LINE
if "(myfile.txt)" in line:
print("File Found")

You can try to pass in the encoding utf-8 and split it by \n.
for line in subprocess.check_output(['ls', '-l'], encoding="utf-8").split("\n"):
# print(line)
if "myfile.txt" in line:
print("File Found")
As originally, check_output was returning bytes, thus we pass in encoding here. Also, since you want to search it line by line, we split it with \n. (Tested on Python 3.)
subprocess.check_output: ... By default, this function will return
the data as encoded bytes. The actual encoding of the output data may
depend on the command being invoked, so the decoding to text will
often need to be handled at the application level.
This behaviour may be overridden by setting universal_newlines to True
as described above in Frequently Used Arguments. -- cited from https://docs.python.org/3/library/subprocess.html#subprocess.check_output

Why not use something that is more reliable such as os.listdir or glob:
import glob
if glob.glob('myfile.txt'):
print('File found')
else:
print('File not found')
The glob.glob function returns a list of files that match the wildcard. In this case, you will have ['myfile.txt'] if the file exists, or [] if not.

import os
def find(name):
for root, dirs, files in os.walk('C:\\');
if name in files:
print(root,name)
print("FINISH")
input()
try:
s=input("name: ")
find(s)
except:
None

to output the contents of a directory, i would recommend the os module.
import os
content = os.listdir(os.getcwd())
then you have a searchable list.
But are you sure, your file ist named (myfile.txt) ??

Pipe git commit hash to file in Python

I'm trying to write to file the commit hash via Python. So I did:
f = open('git.txt', 'w')
f.write(str(subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])))
f.close()
But this wrote the following to file:
b'714548ca074bd6e7c40973375e32413e63a67027\n'
I would like just:
714548ca074bd6e7c40973375e32413e63a67027
How may I do that?

That's just a byte string. All you need to do is decode it before writing it:
r = subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])
f.write(r.strip().decode())
r.strip() was called to remove the trailing '\n', you can alternatively do r[:-1].decode() if you prefer that.
Also, as #torek notes, it is best to open files using the with statement which automatically closes it for you.
So:
# add .strip().decode() at the end if you want a single line statement.
res = subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])
with open('git.txt', 'w') as f:
f.write(res.strip().decode())

In Python 3, subprocess.check_output returns bytes objects, not str strings:
By default, this function will return the data as encoded bytes. The actual encoding of the output data may depend on the command being invoked, so the decoding to text will often need to be handled at the application level.
However, if you're confident you'll be getting data in your platform's default encoding (safe enough, here), you can set the parameter universal_newlines to True:
If universal_newlines is True, these file objects will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False).
This will also handle common whitespace annoyances like end-of-line characters (as the name implies).
Here's a function that returns Git's output as a string, using universal_newlines:
def git_hash(commit_name='HEAD'):
git_command = 'C:/Program Files/Git/bin/git'
hash_string = subprocess.check_output(
[git_command, 'rev-parse', commit_name],
universal_newlines=True
)
return hash_string
And here is an example of writing that string to a file:
fname = 'C:/temp/git_hash.txt'
with open(fname, 'w') as f:
f.write(git_hash())
This uses the with open(...): syntax that was suggested in comments, and also in The Python Tutorial. It's (unfortunately) well-hidden, appearing at the end of section 7.2.1. Methods of File Objects.

Cannot open filename that has umlaute in python 2.7 & django 1.9

I am trying doing a thing that goes through every file in a directory, but it crashes every time it meets a file that has an umlaute in the name. Like ä.txt
the shortened code:
import codecs
import os
for filename in os.listdir(WATCH_DIRECTORY):
with codecs.open(filename, 'rb', 'utf-8') as rawdata:
data = rawdata.readline()
# ...
And then I get this:
IOError: [Errno 2] No such file or directory: '\xc3\xa4.txt'
I've tried to encode/decode the filename variable with .encode('utf-8'), .decode('utf-8') and both combined. This usually leads to "ascii cannot decode blah blah"
I also tried unicode(filename) with and without encode/decode.
Soooo, kinda stuck here :)

You are opening a relative directory, you need to make them absolute.
This has nothing really to do with encodings; both Unicode strings and byte strings will work, especially when soured from os.listdir().
However, os.listdir() produces just the base filename, not a path, so add that back in:
for filename in os.listdir(WATCH_DIRECTORY):
fullpath = os.path.join(WATCH_DIRECTORY, filename)
with codecs.open(fullpath, 'rb', 'utf-8') as rawdata:
By the way, I recommend you use the io.open() function rather than codecs.open(). The io module is the new Python 3 I/O framework, and is a lot more robust than the older codecs module.

Don't convert newline when reading a file

I'm reading a text file:
f = open('data.txt')
data = f.read()
However newline in data variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').
How can I instruct Python to read the file as is?

In Python 2.x:
f = open('data.txt', 'rb')
As the docs say:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
In Python 3.x, there are three alternatives:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes instead of str, which you will have to explicitly decode to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str object is; in 3.x str is Unicode.)
f2 = open('data.txt', 'r', newline='')
This will return str, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline and friends will treat '\r\n' as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
The reason you need to specify an explicit encoding for f3 is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)" to "don't decode, and return raw bytes instead of str". Again, from the docs:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
However:
'encoding' … should only be used in text mode.
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument.
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes, obviously f and f1are the same. But if you want to deal instr, as appropriate for each version, the simplest answer is to write different code for each, probablyfandf2`, respectively. If this comes up a lot, consider writing either wrapper function:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False) almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII' in 2.x. Using locale.getpreferredencoding(True) is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
In either case, the 'r' means "read-only". If you don't specify a mode, the default is 'r', so the binary-mode equivalent to the default is 'rb'.

You need to open the file in the binary mode:
f = open('data.txt', 'rb')
data = f.read()
('r' for "read", 'b' for "binary")
Then everything is returned as is, nothing is normalized

You can use the codecs module to write 'version-agnostic' code:
Underlying encoded files are always opened in binary mode. No automatic conversion of '\n' is done on reading and writing. The mode argument may be any binary mode acceptable to the built-in open() function; the 'b' is automatically added.
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()

Just request "read binary" in the open:
f = open('data.txt', 'rb')
data = f.read()

Open the file using open('data.txt', 'rb'). See the doc.

"an integer is required" when open()'ing a file as utf-8?

I have a file I'm trying to open up in python with the following line:
f = open("C:/data/lastfm-dataset-360k/test_data.tsv", "r", "utf-8")
Calling this gives me the error
TypeError: an integer is required
I deleted all other code besides that one line and am still getting the error. What have I done wrong and how can I open this correctly?

From the documentation for open():
open(name[, mode[, buffering]])
[...]
The optional buffering argument specifies the file’s desired buffer
size: 0 means unbuffered, 1 means line buffered, any other positive
value means use a buffer of (approximately) that size. A negative
buffering means to use the system default, which is usually line
buffered for tty devices and fully buffered for other files. If
omitted, the system default is used.
You appear to be trying to pass open() a string describing the file encoding as the third argument instead. Don't do that.

You are using the wrong open.
>>> help(open)
Help on built-in function open in module __builtin__:
open(...)
open(name[, mode[, buffering]]) -> file object
Open a file using the file() type, returns a file object. This is the
preferred way to open a file. See file.__doc__ for further information.
As you can see it expects the buffering parameter which is a integer.
What you probably want is codecs.open:
open(filename, mode='rb', encoding=None, errors='strict', buffering=1)

From the help docs:
open(...)
open(file, mode='r', buffering=-1, encoding=None,
errors=None, newline=None, closefd=True) -> file object
you need encoding='utf-8'; python thinks you are passing in an argument for buffering.

The last parameter to open is the size of the buffer, not the encoding of the file.
File streams are more or less encoding-agnostic (with the exception of newline translation on files not open in binary mode), you should handle encoding elsewhere (e.g. when you get the data with a read() call, you can interpret it as utf-8 using its decode method).

This resolved my issue, ie providing an encoding(utf-8) while opening the file
with open('tomorrow.txt', mode='w', encoding='UTF-8', errors='strict', buffering=1) as file:
file.write(result)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with invalid utf8 in fileinput? - python

Following documentation please use opening hook: def main(): for filename in fileinput.input(openhook=fileinput.hook_encoded("utf-8")): filename = filename.strip() process_file(filename)

Related

Python 3 loop through subprocess output to search for filename

Pipe git commit hash to file in Python

Cannot open filename that has umlaute in python 2.7 & django 1.9

Don't convert newline when reading a file

"an integer is required" when open()'ing a file as utf-8?

Categories

Resources