I'm trying to write to file the commit hash via Python. So I did:
f = open('git.txt', 'w')
f.write(str(subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])))
f.close()
But this wrote the following to file:
b'714548ca074bd6e7c40973375e32413e63a67027\n'
I would like just:
714548ca074bd6e7c40973375e32413e63a67027
How may I do that?
That's just a byte string. All you need to do is decode it before writing it:
r = subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])
f.write(r.strip().decode())
r.strip() was called to remove the trailing '\n', you can alternatively do r[:-1].decode() if you prefer that.
Also, as #torek notes, it is best to open files using the with statement which automatically closes it for you.
So:
# add .strip().decode() at the end if you want a single line statement.
res = subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])
with open('git.txt', 'w') as f:
f.write(res.strip().decode())
In Python 3, subprocess.check_output returns bytes objects, not str strings:
By default, this function will return the data as encoded bytes. The actual encoding of the output data may depend on the command being invoked, so the decoding to text will often need to be handled at the application level.
However, if you're confident you'll be getting data in your platform's default encoding (safe enough, here), you can set the parameter universal_newlines to True:
If universal_newlines is True, these file objects will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False).
This will also handle common whitespace annoyances like end-of-line characters (as the name implies).
Here's a function that returns Git's output as a string, using universal_newlines:
def git_hash(commit_name='HEAD'):
git_command = 'C:/Program Files/Git/bin/git'
hash_string = subprocess.check_output(
[git_command, 'rev-parse', commit_name],
universal_newlines=True
)
return hash_string
And here is an example of writing that string to a file:
fname = 'C:/temp/git_hash.txt'
with open(fname, 'w') as f:
f.write(git_hash())
This uses the with open(...): syntax that was suggested in comments, and also in The Python Tutorial. It's (unfortunately) well-hidden, appearing at the end of section 7.2.1. Methods of File Objects.
Related
Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?
Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)
If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))
Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.
Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.
I have basically the following code:
def main():
for filename in fileinput.input():
filename = filename.strip()
process_file(filename)
The script takes a newline-separated list of file names as its input. However, some of the file names contain invalid utf8, which causes fileinput.input() to implode. I've read about the surrogateescape error handler, which I think is what I want, but I don't know how to set the error handler for fileinput.
In short: how do I get fileinput to deal with invalid Unicode?
filenames on POSIX may be arbitrary sequences of bytes (except b'\0' and b'/') i.e., no character encoding can decode them in the general case (that is why os.fsdecode() exists that uses surrogateescape error handler).
You could use a binary mode to read the filenames then either skip undecodable filenames if the input shouldn't contain them or pass them as is (or os.fsdecode()) to functions that expect filenames:
for filename in fileinput.input(mode='rb'):
process_file(os.fsdecode(filename).strip())
Beware, there were several known Python bugs related to using a binary mode and fileinput e.g.:
fileinput should use stdin.buffer for "rb" mode
fileinput.FileInput.readline() always returns str object at the end even if in 'rb' mode
Following documentation please use opening hook:
def main():
for filename in fileinput.input(openhook=fileinput.hook_encoded("utf-8")):
filename = filename.strip()
process_file(filename)
I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly.
For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0
It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters.
And it will further get converted to something like this:
[['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',......
However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.
The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.
Each line is read as {{{ for line in self._infile.xreadlines(): }}}
I am very confused why it would behave this way.
My python code is following.
def __init__(self, infile=sys.stdin, outfile=sys.stdout):
if isinstance(infile, basestring):
infile = open(infile)
if isinstance(outfile, basestring):
outfile = open(outfile, "w")
self._infile = infile
self._outfile = outfile
def sort(self):
lines = []
last_second = None
for line in self._infile.xreadlines():
line = line.replace('\r\n', '')
fields = line.split(',')
if len(fields) < 2:
continue
second = fields[1]
if last_second and second != last_second:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
lines = []
last_second = second
lines.append(fields)
if lines:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
self._infile.close()
self._outfile.close()
The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).
I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.
Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):
def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
with open(filename, "rb") as f:
in_bytes = f.read() # read bytes
text = in_bytes.decode(from_encoding) # decode to unicode
out_bytes = text.encode(to_encoding) # reencode to new encoding
with open(filename, "wb") as f:
f.write(out_bytes) # write back to the file
If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).
I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.
In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.
You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.
I'm reading a text file:
f = open('data.txt')
data = f.read()
However newline in data variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').
How can I instruct Python to read the file as is?
In Python 2.x:
f = open('data.txt', 'rb')
As the docs say:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
In Python 3.x, there are three alternatives:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes instead of str, which you will have to explicitly decode to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str object is; in 3.x str is Unicode.)
f2 = open('data.txt', 'r', newline='')
This will return str, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline and friends will treat '\r\n' as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
The reason you need to specify an explicit encoding for f3 is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)" to "don't decode, and return raw bytes instead of str". Again, from the docs:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
However:
'encoding' … should only be used in text mode.
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument.
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes, obviously f and f1are the same. But if you want to deal instr, as appropriate for each version, the simplest answer is to write different code for each, probablyfandf2`, respectively. If this comes up a lot, consider writing either wrapper function:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False) almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII' in 2.x. Using locale.getpreferredencoding(True) is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
In either case, the 'r' means "read-only". If you don't specify a mode, the default is 'r', so the binary-mode equivalent to the default is 'rb'.
You need to open the file in the binary mode:
f = open('data.txt', 'rb')
data = f.read()
('r' for "read", 'b' for "binary")
Then everything is returned as is, nothing is normalized
You can use the codecs module to write 'version-agnostic' code:
Underlying encoded files are always opened in binary mode. No automatic conversion of '\n' is done on reading and writing. The mode argument may be any binary mode acceptable to the built-in open() function; the 'b' is automatically added.
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()
Just request "read binary" in the open:
f = open('data.txt', 'rb')
data = f.read()
Open the file using open('data.txt', 'rb'). See the doc.