using txt file as input for python - python

I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.

In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.

You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.

Related

Writing a string to CSV using line escapes in python 3

Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?
Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)
If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))
Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.
Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.

Pipe git commit hash to file in Python

I'm trying to write to file the commit hash via Python. So I did:
f = open('git.txt', 'w')
f.write(str(subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])))
f.close()
But this wrote the following to file:
b'714548ca074bd6e7c40973375e32413e63a67027\n'
I would like just:
714548ca074bd6e7c40973375e32413e63a67027
How may I do that?
That's just a byte string. All you need to do is decode it before writing it:
r = subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])
f.write(r.strip().decode())
r.strip() was called to remove the trailing '\n', you can alternatively do r[:-1].decode() if you prefer that.
Also, as #torek notes, it is best to open files using the with statement which automatically closes it for you.
So:
# add .strip().decode() at the end if you want a single line statement.
res = subprocess.check_output(['C:/Program Files/Git/bin/git', 'rev-parse', 'HEAD'])
with open('git.txt', 'w') as f:
f.write(res.strip().decode())
In Python 3, subprocess.check_output returns bytes objects, not str strings:
By default, this function will return the data as encoded bytes. The actual encoding of the output data may depend on the command being invoked, so the decoding to text will often need to be handled at the application level.
However, if you're confident you'll be getting data in your platform's default encoding (safe enough, here), you can set the parameter universal_newlines to True:
If universal_newlines is True, these file objects will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False).
This will also handle common whitespace annoyances like end-of-line characters (as the name implies).
Here's a function that returns Git's output as a string, using universal_newlines:
def git_hash(commit_name='HEAD'):
git_command = 'C:/Program Files/Git/bin/git'
hash_string = subprocess.check_output(
[git_command, 'rev-parse', commit_name],
universal_newlines=True
)
return hash_string
And here is an example of writing that string to a file:
fname = 'C:/temp/git_hash.txt'
with open(fname, 'w') as f:
f.write(git_hash())
This uses the with open(...): syntax that was suggested in comments, and also in The Python Tutorial. It's (unfortunately) well-hidden, appearing at the end of section 7.2.1. Methods of File Objects.

readline() Produces Unexpected String

Getting some practice playing with dictionaries and file i/o today when a file gave me an unexpected output that I'm curious about. I wrote the following simple function that just takes the first line of a text file, breaks it into individual words, and puts each word into a dictionary:
def create_dict(file):
dict = {}
for i, item in enumerate(file.readline().split(' ')):
dict[i]= item
file.seek(0)
return dict
print "Enter a file name:"
f = open(raw_input('-> '))
dict1 = create_dict(f)
print dict1
Simple enough, in every case it produces exactly the expected output. Every case except for one. I have one text file that was created by piping the output of another python script to a text file via the following shell command:
C:\> python script.py > textFile.txt
When I use textFile.txt with my dictionary script, I get an output that looks like:
{0: '\xff\xfeN\x00Y\x00', 1: '\x00S\x00t\x00a\x00t\x00e\x00', 2: '\x00h\x00a\x00s\x00:\x00', 3: '\x00', 4: '\x00N\x00e\x00w\x00', 5: '\x00Y\x00o\x00r\x00k\x00\r\x00\n'}
What is this output called? Why does piping the output of the script to a text file via the command line produce a different type of string than any other text file? Why are there no visible differences when I open this file in my text editor? I searched and searched but I don't even know what that would be called as I'm still pretty new.
Your file is UTF-16 encoded. The first 2 characters is a Byte Order Marker (BOM) \xff and \xfe. Also you will notice that each character appears to take 2 bytes, one of which is \x00.
You can use the codecs module to decode for you:
import codecs
f = codecs.open(raw_input('-> '), 'r', encoding='utf-16')
Or, if you are using Python 3 you can supply the encoding argument to open().
I guess the problem you met is the 'Character Code' problem.
In python, the default character code is ascii,so when you use the open() fuction to read the file, the value will be explain to ascii code.
But, the output may not know what the character code means, you need to decode the output message to see it 'normal like'.
As normal, the system use the utf-8 code to read, you can try to decode(item, 'utf-8').
And you can search for more information about character code, ascii, utf-8, unicode and the transfer method of them.
Hope can helping.
>>> import codecs
>>> codecs.BOM_UTF16_LE
'\xff\xfe'
To read utf-16 encoded file you could use io module:
import io
with io.open(filename, encoding='utf-16') as file:
words = [word for line in file for word in line.split()]
The advantage compared to codecs.open() is that it supports the universal newline mode like the builtin open(), and io.open() is the builtin open() in Python 3.

Python failed to parse txt file but the file is confirmed to be 'txt' file

I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly.
For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0
It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters.
And it will further get converted to something like this:
[['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',......
However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.
The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.
Each line is read as {{{ for line in self._infile.xreadlines(): }}}
I am very confused why it would behave this way.
My python code is following.
def __init__(self, infile=sys.stdin, outfile=sys.stdout):
if isinstance(infile, basestring):
infile = open(infile)
if isinstance(outfile, basestring):
outfile = open(outfile, "w")
self._infile = infile
self._outfile = outfile
def sort(self):
lines = []
last_second = None
for line in self._infile.xreadlines():
line = line.replace('\r\n', '')
fields = line.split(',')
if len(fields) < 2:
continue
second = fields[1]
if last_second and second != last_second:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
lines = []
last_second = second
lines.append(fields)
if lines:
lines = sorted(lines, self._sort_lines)
self._outfile.write("".join([','.join(x) for x in lines]))
#self._outfile.write("\r\n")
self._infile.close()
self._outfile.close()
The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).
I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.
Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):
def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
with open(filename, "rb") as f:
in_bytes = f.read() # read bytes
text = in_bytes.decode(from_encoding) # decode to unicode
out_bytes = text.encode(to_encoding) # reencode to new encoding
with open(filename, "wb") as f:
f.write(out_bytes) # write back to the file
If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).

Reading input files with ASCII 215 as delimiter

I am trying to read from a file which has word pairs delimited by ASCII value 215. When I run the following code:
f = open('file.i', 'r')
for line in f.read().split('×'):
print line
I get a string that looks like garbage. Here is a sample of my input:
abashedness×N
abashment×N
abash×t
abasia×N
abasic×A
abasing×t
Abas×N
abatable×A
abatage×N
abated×V
abatement×N
abater×N
Abate×N
abate×Vti
abating×V
abatis×N
abatjours×p
abatjour×N
abator×N
abattage×N
abattoir×N
abaxial×A
and here is my output after the code above is run:
z?Nlner?N?NANus?A?hion?hk?hhn?he?hanoconiosis?N
My goal is to eventually read this into either a list of tuples or something of that nature, but I'm having trouble just getting the data to print.
Thanks for all help.
Well, two things:
Your source could be Unicode! Use an escape and be safe.
Read in binary mode.
with open("file.i", "rb") as f:
for line in f.read().split(b"\xd7"):
print(line)
The character is delimiting the word and the part of speech, but each word is still on its own line:
with open('file.i', 'rb') as handle:
for line in handle:
word, pos = line.strip().split('×')
print word, pos
Your code was splitting the whole file, so you were ending up with words like N\nabatable, N\nAbate, Vti\nabating.
To interpret bytes from a file as text, you need to know its character encoding. There Ain't No Such Thing As Plain Text. You could use codecs module to read the text:
import codecs
with codecs.open('file.i', 'r', encoding='utf-8') as file:
for line in file:
word, sep, suffix = line.partition(u'\u00d7')
if sep:
print word
Put the actual character encoding of the file instead of utf-8 placeholder e.g., cp1252.
Non-ascii characters in string literals would require source character encoding declaration at the top the script so I've used the unicode escape: u'\u00d7'.
Thanks to both your help, I was able to hack together this bit of code that returns a list of lists holding what I'm looking for.
with open("mobyposi.i", "rb") as f:
content = f.readlines()
f.close()
content = content[0].split()
for item in content:
item.split("\xd7")
It was indeed in unicode! However, the implementation described above discarded the text after the unicode value and before the newline.
EDIT: Able to reduce to:
with open("mobyposi.i", "rb") as f:
for item in f.read().split():
item.split("\xd7")

Categories

Resources