Getting some practice playing with dictionaries and file i/o today when a file gave me an unexpected output that I'm curious about. I wrote the following simple function that just takes the first line of a text file, breaks it into individual words, and puts each word into a dictionary:
def create_dict(file):
dict = {}
for i, item in enumerate(file.readline().split(' ')):
dict[i]= item
file.seek(0)
return dict
print "Enter a file name:"
f = open(raw_input('-> '))
dict1 = create_dict(f)
print dict1
Simple enough, in every case it produces exactly the expected output. Every case except for one. I have one text file that was created by piping the output of another python script to a text file via the following shell command:
C:\> python script.py > textFile.txt
When I use textFile.txt with my dictionary script, I get an output that looks like:
{0: '\xff\xfeN\x00Y\x00', 1: '\x00S\x00t\x00a\x00t\x00e\x00', 2: '\x00h\x00a\x00s\x00:\x00', 3: '\x00', 4: '\x00N\x00e\x00w\x00', 5: '\x00Y\x00o\x00r\x00k\x00\r\x00\n'}
What is this output called? Why does piping the output of the script to a text file via the command line produce a different type of string than any other text file? Why are there no visible differences when I open this file in my text editor? I searched and searched but I don't even know what that would be called as I'm still pretty new.
Your file is UTF-16 encoded. The first 2 characters is a Byte Order Marker (BOM) \xff and \xfe. Also you will notice that each character appears to take 2 bytes, one of which is \x00.
You can use the codecs module to decode for you:
import codecs
f = codecs.open(raw_input('-> '), 'r', encoding='utf-16')
Or, if you are using Python 3 you can supply the encoding argument to open().
I guess the problem you met is the 'Character Code' problem.
In python, the default character code is ascii,so when you use the open() fuction to read the file, the value will be explain to ascii code.
But, the output may not know what the character code means, you need to decode the output message to see it 'normal like'.
As normal, the system use the utf-8 code to read, you can try to decode(item, 'utf-8').
And you can search for more information about character code, ascii, utf-8, unicode and the transfer method of them.
Hope can helping.
>>> import codecs
>>> codecs.BOM_UTF16_LE
'\xff\xfe'
To read utf-16 encoded file you could use io module:
import io
with io.open(filename, encoding='utf-16') as file:
words = [word for line in file for word in line.split()]
The advantage compared to codecs.open() is that it supports the universal newline mode like the builtin open(), and io.open() is the builtin open() in Python 3.
Related
A friend of mine has written simple poetry using C's fprintf function. It was written using the 'wb' option so the generated file is in binary. I'd like to use Python to show the poetry in plain text.
What I'm currently getting are lots of strings like this: ��������
The code I am using:
with open("read-me-if-you-can.bin", "rb") as f:
print f.read()
f.close()
The thing is, when dealing with text written to a file, you have to know (or correctly guess) the character encoding used when writing said file. If the program reading the file is assuming the wrong encoding here, you will end up with strange characters in the text if you're lucky and with utter garbage if you're unlucky.
Don't try to guess, try to know: you need to ask your friend in what character encoding he or she wrote the poetry text to the file. You then have to open the file in Python specifying that character encoding. Let's say his/her answer is "UTF-16-LE" (for sake of example), you then write:
with open("poetry.bin", encoding="utf-16-le") as f:
print(f.read())
It seems you're on Python 2 still though, so there you write:
import io
with io.open("poetry.bin", encoding="utf-16-le") as f:
print f.read()
You could start by trying UTF-8 first though, that is an often used encoding.
I am trying to write text to an output file that explicitly shows all of the newline characters (\n, \r, \r\n,). I am using Python 3 and Windows 7. My thought was to do this by converting the strings that I am writing into bytes.
My code looks like this:
file_object = open(r'C:\Users\me\output.txt', 'wb')`
for line in lines:
line = bytes(line, 'UTF-8')
print('Line: ', line) #for debugging
file_object.write(line)
file_object.close()
The print() statement to standard output (my Windows terminal) is as I want it to be. For example, one line looks like so, with the \n character visible.
Line: b'<p class="byline">Foo C. Bar</p>\n'
However, the write() method does not explicitly print any of the newline characters in my output.txt file. Why does write() not explicitly show the newline characters in my output text file, even though I'm writing in bytes mode, but print does explicitly show the newline characters in the windows terminal?
What Python does when writing strings or bytes to text or binary files:
Strings to a text file. Directly written.
Bytes to a text file. Writes the repr.
Strings to a binary file. Throws an exception.
Bytes to a binary file. Directly written.
You say that you get what you’re looking for when you write a bytes to standard out (a text file). That, with the pseudo-table above, suggests you might look into using repr. Specifically, if you’re looking for the output b'<p class="byline">Foo C. Bar</p>\n', you’re looking for the repr of a bytes object. If line was a str to start with and you don’t actually need that b at the beginning, you might instead be looking for the repr of the string, '<p class="byline">Foo C. Bar</p>\n'. If so, you could write it like this:
with open(r'C:\Users\me\output.txt', 'w') as file_object:
for line in lines:
file_object.write(repr(line) + '\n')
I have a python program that requires the user to paste texts into it to process them to the various tasks. Like this:
line=(input("Paste text here: ")).lower()
The pasted text comes from a .txt file. To avoid any issues with the code (since the text contains multiple quotation marks), the user has to do the following: type 3 quotation marks, paste the text, and type 3 quotation marls again.
Can all of the above be avoided by having python read the .txt? and if so, how?
Please let me know if the question makes sense.
In Python2, just use raw_input to receive input as a string. No extra quotation marks on the part of the user are necessary.
line=(raw_input("Paste text here: ")).lower()
Note that input is equivalent to
eval(raw_input(prompt))
and applying eval to user input is dangerous, since it allows the user to evaluate arbitrary Python expressions. A malicious user could delete files or even run arbitrary functions so never use input in Python2!
In Python3, input behaves like raw_input, so there your code would have been fine.
If instead you'd like the user to type the name of the file, then
filename = raw_input("Text filename: ")
with open(filename, 'r') as f:
line = f.read()
Troubleshooting:
Ah, you are using Python3 I see. When you open a file in r mode, Python tries to decode the bytes in the file into a str. If no encoding is specified, it uses locale.getpreferredencoding(False) as the default encoding. Apparently that is not the right encoding for your file. If you know what encoding your file is using, it is best to supply it with the encoding parameter:
open(filename, 'r', encoding=...)
Alternatively, a hackish approach which is not nearly as satisfying is to ignore decoding errors:
open(filename, 'r', errors='ignore')
A third option would be to read the file as bytes:
open(filename, 'rb')
Of course, this has the obvious drawback that you'd then be dealing with bytes like \x9d rather than characters like ·.
Finally, if you'd like some help guessing the right encoding for your file, run
with open(filename, 'rb') as f:
contents = f.read()
print(repr(contents))
and post the output.
You can use the following:
with open("file.txt") as fl:
file_contents = [x.rstrip() for x in fl]
This will result in the variable file_contents being a list, where each element of the list is a line of your file with the newline character stripped off the end.
If you want to iterate over each line of the file, you can do this:
with open("file.txt") as fl:
for line in fl:
# Do something
The rstrip() method gets rid of whitespace at the end of a string, and it is useful for getting rid of the newline character.
I am trying to make a simple translator that from a dictionary in a shelve module I can type words in English and the program translates the input word by word and then puts the results into a .txt file. This is pretty much what I have so far.
import shelve
s = shelve.open("THAI.dat")
entry = input("English word")
define = input("Thai word")
s[entry]=define
text_file = open("THAI.txt", "w+")
trys = input("Input english word")
if trys in s:
print(s[trys])
part = s[trys]
text_file.write(part)
this is where the error appears. I think the problem is that part is a list and is should be a string to be able to be written to a .txt file. What should I do. I am just a beginner so I am probably missing something basic. This is the error.
Traceback (most recent call last):
File "C:\Users\Austen\Desktop\phython fun\thai translator.py", line 29, in <module>
text_file.write(part)
TypeError: must be str, not list
>>>
at the end I would like to be able to do this
text_file.readlines()
and then be able to even go into the text file and see the translation.
From your comments, besides not having s[entry]=[define], I think you need to read and write a Thai file using the right codec.
Assuming the file thai.dat was written with UTF-8 (an assumption) you now need to compare the strings using the same codec and the write your data file with the same codec.
As a start, try this line from your command shell:
python -c 'import sys; print sys.getdefaultencoding()'
If it prints ascii then you may need to set your default encoding to UTF-8 or the string comparisons will not work properly.
Also, you need open the output file in UTF-8 mode like so:
>>>import codecs
>>>f = codecs.open("THAI.txt", "w+", "utf-8")
Then write to this file as usual.
I have a huge gzipped text file which I need to read, line by line. I go with the following:
for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
print i, line
At some point late in the file, the python output diverges from the file. This is because lines are getting broken due to weird special characters that python thinks are newlines. When I open the file in 'vim', they are correct, but the suspect characters are formatted weirdly. Is there something I can do to fix this?
I've tried other codecs including utf-16, latin-1. I've also tried with no codec.
I looked at the file using 'od'. Sure enough, there are \n characters where they shouldn't be. But, the "wrong" ones are prepended by a weird character. I think there's some encoding here with some characters being 2-bytes, but the trailing byte being a \n if not viewed properly.
According to 'od -h file' the offending character is '1d1c'.
If I replace:
gzip.open('file.gz')
With:
os.popen('zcat file.gz')
It works fine (and actually, quite faster). But, I'd like to know where I'm going wrong.
Try again with no codec. The following reproduces your problem when using codec, and the absence of the problem without it:
import gzip
import os
import codecs
data = gzip.open("file.gz", "wb")
data.write('foo\x1d\x1cbar\nbaz')
data.close()
print list(codecs.getreader('utf-8')(gzip.open('file.gz')))
print list(os.popen('zcat file.gz'))
print list(gzip.open('file.gz'))
Outputs:
[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']
I asked (in a comment) """Show us the output from print repr(weird_special_characters). When you open the file in vim, WHAT are correct? Please be more precise than "formatted weirdly".""" But nothing :-(
What file are you looking at with od? file.gz?? If you can see anything recognisable in there, it's not a gzip file! You're not seeing newlines, you're seeing binary bytes that contain 0x0A.
If the original file was utf-8 encoded, what was the point of trying it with other codecs?
Does "works OK with zcat" mean that you got recognisable data without a utf8 decode step??
I suggest that you simplify your code, and do it a step at a time ... see for example the accepted answer to this question. Try it again and please show the exact code that you ran, and use repr() when describing the results.
Update It looks like DS has guessed what you were trying to explain about the \x1c and \x1d.
Here are some notes on WHY it happens like that:
In ASCII, only \r and \n are considered when line-breaking:
>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
'A\x0bA\x0cA\r', # line break
'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>
However in Unicode, the characters \x1D (FILE SEPARATOR), \x1E (GROUP SEPARATOR), and \x1E (RECORD SEPARATOR) also qualify as line-endings:
>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
u'A\x0bA\x0cA\r', # line break
u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
u'A\x1d', # line break
u'A\x1e', # line break
u'A\x1fBBB']
>>>
This will happen whatever codec you use. You still need to work out what (if any) codec you need to use. You also need to work out whether the original file was really a text file and not a binary file. If it's a text file, you need to consider the meaning of the \x1c and \x1d in the file.