I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).
So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined>, which I assume means that the program ran into a non-Unicode character. I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError?
One crude workaround is to decode the bytes from the file yourself and specify the error handling. EG:
for line in somefile:
uline = line.decode('ascii', errors='ignore')
That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). But if you're entirely certain you only care about ascii characters this is a simple fallback.
The error suggests that you're using open() function without specifying an explicit character encoding. locale.getpreferredencoding(False) is used in this case (e.g., cp1252). The error says that it is not an appropriate encoding for the input.
An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. Otherwise the encoding is defined by BOM or it is utf-8. If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8:
with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
...
If the input is an actual XML then just pass it to an XML parser instead:
import xml.etree.ElementTree as etree
tree = etree.parse('input.xml')
I use django and in my view i need to send a request as XML with some unicode character that received from html page with post method. I tried these (Note that i save that input in fname variable) :
xml = r"""my XML code with unicode {0} """.format(fname)
And
fname = u"%s".encode('utf8') % (fname)
xml = r"""my XML code with unicode {0} """.format(fname)
And
fname = fname.encode('ascii', 'ignore').decode('ascii')
xml = r"""my XML code with unicode {0} """.format(fname)
And every time i got this error:
'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
You could reproduce the error with this code:
>>> "{0}".format(u"\U0001F384"*4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
To fix this particular error, just use Unicode format string:
>>> u"{0}".format(u"\U0001F384"*4)
u'\U0001f384\U0001f384\U0001f384\U0001f384'
You could use xml.etree.ElementTree module to build your xml document instead of string formatting. xml is a complex format; it is easy to get it wrong. ElementTree will also serialize your Unicode string into bytes correctly making sure that the character encoding in the xml declaration is consistent with the actual encoding that is used in the document.
xml = r"""my XML code with unicode {0} """.format(fname)
The .format method always produces the same output string type as the input format string. In the case your format string is a byte string r"""...""" so if fname is a Unicode string Python tries to force it into being a byte string. If frame contains characters that do not exist in the default encoding (ASCII) then bang.
Note that this differs from the old string formatting operator %, which tries to promote to Unicode string when either the format string or any of the arguments used are Unicode, which would work in this case as long as the my XML code was ASCII-compatible. This is a common problem when you convert code that uses % to .format().
This should work fine:
xml = ur"""my XML code with unicode {0} """.format(fname)
However the output will be a Unicode string so whatever you do next needs to cope with that (for example if you are writing it to a byte stream/file, you would probably want to .encode('utf-8') the whole thing). Alternatively encode it in place to get a byte string:
xml = r"""my XML code with unicode {0} """.format(fname.encode('utf-8'))
Note that this above:
fname = u"%s".encode('utf8') % (fname)
does not work because you are encoding the format string to bytes, not the fname argument. This is identical to saying just fname = '%s' % fname, which is effectively fname = fname.
I Solved that with this code:
fname = fname.encode('ascii', 'xmlcharrefreplace')
This smells bad. For input hello ☃, you are now generating hello ☃ instead of the normal output hello ☃.
If both ☃ and ☃ look the same to you in the output then probably you are doing something like this:
xml = '<element>{0}</element>'.format(some_text)
which is broken for XML-special characters like & and <. When you are generating XML you should take care to escape special characters (&<>"', to &, < etc), otherwise at best your output will break for these characters; at worst, when some_text includes user input you have an XML-injection vulnerability which may break the logic of your system in a security-senesitive way.
As J F Sebastian said (+1), it's a good idea to use existing known-good XML serialisation libraries like etree instead of trying to roll your own.
You could do something like:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but its an ugly hack-ish that only works in python 2.7.
or
fname.encode('GB18030').decode('utf-8')
it will bypass the error but may still look messy. If you're posting to an html file then make the charset utf-8
I Solved that with this code:
fname = fname.encode('ascii', 'xmlcharrefreplace')
xml = r"""my XML code with unicode {0} """.format(fname)
Thank you for your help.
Update :
And you can remove or replace special characters like > & < with this (Thanks to #bobince for notice this) :
fname = fname.replace("<", "")
fname = fname.replace(">", "")
fname = fname.replace("&", "")
I was simply trying to import a Chinese txt file and print out the content.
Here is the content of my txt file which i copy from the web,which is in simplified chinese :http://stock.hexun.com/2013-06-01/154742801.html
At first, i tried this out:
userinput = raw_input('Enter the name of a file')
f=open(userinput,'r')
print f.read()
f.close()
It can open the file and print but what is show is garbled.
Then i tried the following one with encoding:
#coding=UTF-8
userinput = raw_input('Enter the name of a file')
import codecs
f= codecs.open(userinput,"r","UTF-8")
str1=f.read()
print str1
f.close()
However, it show me an error message.
UnicodeEncodeError: 'cp950 codec cant't encode character u'\u76d8' in position 50:illegal mutibyte sequence.
Why is that error happened? How to solve it?
I have tried other unicode like Big5,cp950... but it still not work.
It is the terminal system you are using to display the character. Using IDLE on Windows 7 and it works fine:
>>> val = u'\u76d8'
>>> print val
盘
but if I use cmd.exe then I get your error.
Use a terminal display method that supports unicode encoding.
Python (at least before Python 3.0) knows two kinds of string: ① a byte array and ② a character array.
Characters as in ② are Unicode, the type of these kind of strings is also called unicode.
The bytes in ① (type named str in Python) can be a printable string or something else (binary data). If it's a printable string, it also can be an encoded version (e. g. UTF-8, latin-1 or similar) of a string of Unicode characters. Then several bytes can represent a single character.
In your usecase I'd propose to read the file as a list of bytes:
with open('filename.txt') as inputFile:
bytes = inputFile.read()
Then convert that byte array to a decent Unicode string by decoding it from the encoding used in the file (you will have to find that out!):
unicodeText = bytes.decode('utf-8')
Then print it:
print unicodeText
The last step depends on the capabilities of your output device (xterm, …). It may be capable of displaying Unicode characters, then everything is fine and the characters get displayed properly. But it might be incapable of Unicode, or, more likely, Python is just not well-informed about the capabilities, then you will get an error message. This also will happen if you redirect your output into a file or pipe it into a second process.
To prevent this trouble, you can convert the Unicode string into a byte-array again, choosing an encoding of your choice:
print unicodeText.encode('utf-8')
This way you will only print bytes which every terminal, output file and second process (when piping) can handle.
If input and output encoding are the same, then of course, you won't have to decode and encode anything. But since you have some trouble, most likely the encodings differ, so you will have to do these two steps.
Code page 936 is the only one that has character 0x76D8 (which encodes to 0xC5CC). You need to use gbk or cp936
with open('chinese.txt','r+b') as inputFile:
bytes = inputFile.read()
print(bytes.decode('utf8'))
JUst TRY:
f=open(userinput,'r')
print f.read().decode('gb18030').encode('u8')
I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8 encoded str:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str (as opposed to repr()) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs to unicode:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've read this question:
Reading a UTF8 CSV file with Python
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)
As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach.
Your default encoding is ASCII. When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII.
The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.
To print a unicode, use print some_unicode.encode('utf-8'). (Or whatever encoding your terminal is actually using).
As for the u'\xc1lvaro Salazar', nothing here is mangled. The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \x hex escapes instead of \u unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as \u00c1.)
To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html