Related
I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).
So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:
être is être, for example (atleast when I open the file in Excel)
Here is the csv:
https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv
In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.
...
for row in reader:
inf_lst.append(row[0])
verb = inf_lst[2338]
#(verb = être)
if there was a straightforward/built in method for printing it out with correct unicode to give "être"?
I am aware that you could do this by replacing the ê's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way.
Thanks,
You can use unicode encoding by prefixing a string with 'u'.
>>> foo = u'être' >>> print foo être
It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.
You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:
>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
... stream=True)
>>> try:
... inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
... del resp
...
>>> len(inf_list)
5362
You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.
To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:
import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
f.write(u'être')
I got some strings from the database which look like '\xe7\x8e\xa9'.
I think it's utf-8. I can print them out by using:
print '\xe7\x8e\xa9'
玩
The things is, I need write them into another file as Chinese Character(e.g. 玩) together with other alphanumeric data.
I tried encode, decode but I didn't get the results I was hoping for.
Here are my attempts:
f = open('a','w')
name = u.name #.encode('utf8') # I commented it to get raw
f.write('\t$$%r$$many_other_data' % name)
f.close()
When I open the output file with vim7.4:
`$$u'\u7aef\u5e84\u7684\u9a6c\u6b47\u5c14$$many_other_data'`
Here is code sample working for me:
with open('foo', 'w+') as f:
f.write('\xe7\x8e\xa9')
and in foo file a have:
玩
but, I've open foo with utf-8 encoding, so it's displays chines character instead of Unicode value.
I've tested it with both vim and gedit and it works just fine.
Perhaps you should provide type of your output file, so we can be more specific.
EDIT
I see the problem now. You used %r flag in writing your string. You should use %s (and enable encoding again).
Here is working example:
>>> a = u'\u7aef\u5e84\u7684\u9a6c\u6b47\u5c14'
>>> f = open('tmp', 'w')
>>> a = a.encode('utf-8')
>>> f.write('\t$$%r$$other_data\n'%a)
>>> f.write('\t$$%s$$other_data\n'%a)
>>> f.close
results being:
$$'\xe7\xab\xaf\xe5\xba\x84\xe7\x9a\x84\xe9\xa9\xac\xe6\xad\x87\xe5\xb0\x94'$$other_data
$$端庄的马歇尔$$other_data
Please ready this answer for reference about difference between %r and %s.
Hope that helped.
Files are bytes. You can't store characters in them.
A particularly common encoding is ASCII. It's an encoding just like all those different unicode ones.
The bytes are meaniningless (as text) on their own without an associated encoding to give them meaning.
You'll need to view the file with an editor or viewer that is using the same encoding that you used to write the file.
Since you have bytes, you need to know your encoding. There are multiple ways to turn bytes into unicode (str.decode), and it's depending on what encoding the bytes are in.
You can't get this from the bytes themselves, someone has to tell you the encoding.
Although, sometimes you can make an educated guess:
>>> import chardet
>>> s = '\xe7\x8e\xa9'
>>> chardet.detect(s)
{'confidence': 0.505, 'encoding': 'utf-8'}
>>> s.decode(chardet.detect(s)['encoding'])
u'\u73a9'
>>> print _
玩
Now, you should convert any strings from db to unicode as soon as they enter your python program so that your code is working entirely in unicode, not bytes.
Then, you can write your file like this:
import io
with io.open('/tmp/myfile.txt', 'wb', encoding='utf-8') as f:
f.write(u'\u73a9')
f.write('\n')
f.write('random other data 12345...')
I was simply trying to import a Chinese txt file and print out the content.
Here is the content of my txt file which i copy from the web,which is in simplified chinese :http://stock.hexun.com/2013-06-01/154742801.html
At first, i tried this out:
userinput = raw_input('Enter the name of a file')
f=open(userinput,'r')
print f.read()
f.close()
It can open the file and print but what is show is garbled.
Then i tried the following one with encoding:
#coding=UTF-8
userinput = raw_input('Enter the name of a file')
import codecs
f= codecs.open(userinput,"r","UTF-8")
str1=f.read()
print str1
f.close()
However, it show me an error message.
UnicodeEncodeError: 'cp950 codec cant't encode character u'\u76d8' in position 50:illegal mutibyte sequence.
Why is that error happened? How to solve it?
I have tried other unicode like Big5,cp950... but it still not work.
It is the terminal system you are using to display the character. Using IDLE on Windows 7 and it works fine:
>>> val = u'\u76d8'
>>> print val
盘
but if I use cmd.exe then I get your error.
Use a terminal display method that supports unicode encoding.
Python (at least before Python 3.0) knows two kinds of string: ① a byte array and ② a character array.
Characters as in ② are Unicode, the type of these kind of strings is also called unicode.
The bytes in ① (type named str in Python) can be a printable string or something else (binary data). If it's a printable string, it also can be an encoded version (e. g. UTF-8, latin-1 or similar) of a string of Unicode characters. Then several bytes can represent a single character.
In your usecase I'd propose to read the file as a list of bytes:
with open('filename.txt') as inputFile:
bytes = inputFile.read()
Then convert that byte array to a decent Unicode string by decoding it from the encoding used in the file (you will have to find that out!):
unicodeText = bytes.decode('utf-8')
Then print it:
print unicodeText
The last step depends on the capabilities of your output device (xterm, …). It may be capable of displaying Unicode characters, then everything is fine and the characters get displayed properly. But it might be incapable of Unicode, or, more likely, Python is just not well-informed about the capabilities, then you will get an error message. This also will happen if you redirect your output into a file or pipe it into a second process.
To prevent this trouble, you can convert the Unicode string into a byte-array again, choosing an encoding of your choice:
print unicodeText.encode('utf-8')
This way you will only print bytes which every terminal, output file and second process (when piping) can handle.
If input and output encoding are the same, then of course, you won't have to decode and encode anything. But since you have some trouble, most likely the encodings differ, so you will have to do these two steps.
Code page 936 is the only one that has character 0x76D8 (which encodes to 0xC5CC). You need to use gbk or cp936
with open('chinese.txt','r+b') as inputFile:
bytes = inputFile.read()
print(bytes.decode('utf8'))
JUst TRY:
f=open(userinput,'r')
print f.read().decode('gb18030').encode('u8')
Probably I completely don't understand it, so can you take a look at code examples and tell my what should I do, to be sure it will work?
I tried it in Eclipse with Pydev. I use python 2.6.6 (becuase of some library that not support python 2.7).
First, without using codecs module
# -*- coding: utf-8 -*-
file1 = open("samoloty1.txt", "w")
file2 = open("samoloty2.txt", "w")
file3 = open("samoloty3.txt", "w")
file4 = open("samoloty4.txt", "w")
file5 = open("samoloty5.txt", "w")
file6 = open("samoloty6.txt", "w")
# I know that this is weird, but it shows that whatever i do, it not ruin anything...
print u"ą✈✈"
file1.write(u"ą✈✈")
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈".decode("utf-8")
file3.write("ą✈✈".decode("utf-8"))
print "ą✈✈".encode("utf-8")
file4.write("ą✈✈".encode("utf-8"))
print u"ą✈✈".decode("utf-8")
file5.write(u"ą✈✈".decode("utf-8"))
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
file1.close()
file2.close()
file3.close()
file4.close()
file5.close()
file6.close()
file1 = open("samoloty1.txt", "r")
file2 = open("samoloty2.txt", "r")
file3 = open("samoloty3.txt", "r")
file4 = open("samoloty4.txt", "r")
file5 = open("samoloty5.txt", "r")
file6 = open("samoloty6.txt", "r")
print file1.read()
print file2.read()
print file3.read()
print file4.read()
print file5.read()
print file6.read()
Every each of those prints works correctly and I don't get any funny characters.
Also i tried this: i delete all files made in the previous test and change only those lines:
file1 = open("samoloty1.txt", "w")
to those:
file1 = codecs.open("samoloty1.txt", "w", encoding='utf-8')
and again everything works...
Can anyone make some examples what works, and what not?
Should this be separate question?
I am downloading web pages, through this:
content = urllib.urlopen(some_url).read()
ucontent = unicode(content, encoding) # i get encoding from headers
Is this correct and enough? What should I do next with it to store it in utf-8 file? (I ask it because whatever I did before, it just works...)
** UPDATE **
Probably everything works ok because PyDev (or just Eclipse) has terminal encoded in UTF-8. So for tests i used cmd from Windows 7 and i get some errors. Now everything was crashing as expected. :D Here i am showing what i changed to get it working again (and all of those changes are reasonable for me and they agree with what i learn in answers and in docs in Python documentations).
print u"ą✈✈".encode("utf-8") # added encode
file1.write(u"ą✈✈".encode("utf-8")) # added encode
print "ą✈✈"
file2.write("ą✈✈")
print "ą✈✈" # removed .decode("utf-8")
file3.write("ą✈✈") # removed .decode("utf-8"))
print "ą✈✈" # removed .encode("utf-8")
file4.write("ą✈✈") # removed .encode("utf-8"))
print u"ą✈✈".encode("utf-8") # changed from .decode("utf-8")
file5.write(u"ą✈✈".encode("utf-8")) # changed from .decode("utf-8")
print u"ą✈✈".encode("utf-8")
file6.write(u"ą✈✈".encode("utf-8"))
And like someone said, when i use codecs, i not need to use encode() everytime before writing to file. :)
Question is, which answer should be marked as correct?
You are just lucky that the encoding of your console is utf-8 by default.
If you pass a unicode object to the write method method of a file object (sys.stdout) the object is implicitly decoded with its encoding attribute.
Thouse who work in Windows are not so lucky: How to workaround Python "WindowsError messages are not properly encoded" problem?
All those write exercises in the code snippet actually boil down to two situations:
when you write string to the file
when you try to write unicode string to the file
Lets call string as s and unicode string as u.
Then fileN.write(s) makes sense, and fileN.write(u) doesn't. I don't know about your setup (maybe you have made some changes to site's python), but the following expectedly breaks here:
# -*- coding: utf-8 -*-
ff = open("ff.txt", "w")
ff.write(u"ą✈✈")
ff.close()
with:
Traceback (most recent call last):
File "ex.py", line 5, in <module>
ff.write(u"ą✈✈")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
It means, that unicode string should be changed to string before writing to file. And your file6 example shows how to do it:
u"ą✈✈".encode("utf-8")
The magic string -*- coding: utf-8 -*- is the one which enables you to write unicode string literals in a WYSIWYG way: u"ą✈✈", it doesn't help you to determine your encoding in any other situation.
Thus, do not give .write() method in Python2.6 any unicode string. The good practice is to work with unicode strings in your code but convert from/to concrete encoding at the input/output borders.
The codecs example is good, as well as urllib.
What you are doing is correct. See this Python unicode howto for more info.
The general principles are:
When binary data comes in to your application (e.g., open(), urllib.urlopen()), use the decode() method to get a unicode string.
If the byte string is invalid for the supplied encoding, you may get UnicodeDecodeError. In this case do one of the following:
Use the second argument to decode to either replace or ignore bad characters
try harder to find out what the real encoding is
fix the input if it really is mangled.
For files, you can use the codecs.open wrappers to do this transparently for you.
Network data you must generally decode by hand, but sometimes the payload declares its own encoding (e.g., html, XML), and sometimes it doesn't match the header!
For database data, usually the database driver will have some method of doing encoding/decoding transparently for you and always give you unicode strings. Otherwise you will need to encode/decode by hand.
Use unicode strings in your application.
Right before the binary data leaves your application, use encode() on the string to encode to your desired encoding.
If your target encoding cannot represent some of your unicode characters, you may get UnicodeEncodeError. In this case do one of the following:
Use the second argument to encode() to ignore or replace characters that can't be represented in the target encoding;
Don't generate these characters in your application.
Find an alternate way of representing them. E.g., in XML, you can use a numeric character entity.
For files, you may use the codecs.open wrapper to do encoding for you transparently.
For database connections, the driver will often have an option to accept unicode strings and encode for you.
For network connections, you must generally encode by hand. Sometimes the payload will be generated by a library that will encode properly for you (e.g., writing XML).
Because you are correctly using the magic "coding comment," everything works as supposed.