I want to make sure all string are unicode in my code, so I use unicode_literals, then I need to write string to file:
from __future__ import unicode_literals
with open('/tmp/test', 'wb') as f:
f.write("中文") # UnicodeEncodeError
so I need to do this:
from __future__ import unicode_literals
with open('/tmp/test', 'wb') as f:
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
but every time I need to encode in code, I am lazy, so I change to codecs:
from __future__ import unicode_literals
from codecs import open
import locale, codecs
lang, encoding = locale.getdefaultlocale()
with open('/tmp/test', 'wb', encoding) as f:
f.write("中文")
still I think this is too much if I just want to write to file, any easier method?
You don't need to call .encode() and you don't need to call locale.getdefaultlocale() explicitly:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import io
with io.open('/tmp/test', 'w') as file:
file.write(u"中文" * 4)
It uses locale.getpreferredencoding(False) character encoding to save Unicode text to the file.
On Python 3:
you don't need to use the explicit encoding declaration (# -*- coding: utf-8 -*-), to use literal non-ascii characters in your Python source code. utf-8 is the default.
you don't need to use import io: builtin open() is io.open() there
you don't need to use u'' (u prefix). '' literals are Unicode by default. If you want to omit u'' then put back from __future__ import unicode_literals as in your code in the question.
i.e., the complete Python 3 code is:
#!/usr/bin/env python3
with open('/tmp/test', 'w') as file:
file.write("中文" * 4)
What about this solution?
Write to UTF-8 file in Python
Only three lines of code.
Related
I want to use Python to read the .csv.
At start I search the answer to add the
#!/usr/bin/python
#-*-coding:utf-8 -*-
so that can avoid the problem of encoding, but it is still wrong, giving the syntax error:
SyntaxError: Non-ASCII character 'xe6' in file csv1.py on line2, but no encoding declared:
My code:
#!/usr/bin/python
# -*-coding:utf-8 -*-
import csv
with open('wtr1.csv', 'rb') as f:
for row in csv.reader(f):
print row
You've got two different errors here. This answer relates to the with warning. The other error is the ascii encoding error.
You appear to be using a very old version of python (2.5). The with statement is not enabled by default in python 2.5. Instead you have to declare a the top of the file that you wish to use it. Your file should now look like:
#!/usr/bin/python
# -*-coding:utf-8 -*-
from __future__ import with_statement
import csv
with open('wtr1.csv', 'rb') as f:
for row in csv.reader(f):
print row
This code should write some text to file.
When I'm trying to write my text to console, everything works. But when I try to write the text into the file, I get UnicodeEncodeError. I know, that this is a common problem which can be solved using proper decode or encode, but I tried it and still getting the same UnicodeEncodeError. What am I doing wrong?
I've attached an example.
print "(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2])
(StarBuy s.r.o.,Inzertujte s foto, auto-moto, oblečenie, reality, prácu, zvieratá, starožitnosti, dovolenky, nábytok, všetko pre deti, obuv, stroj....
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u010d' in position 50: ordinal not in range(128)
Where could be the problem?
To write Unicode text to a file, you could use io.open() function:
#!/usr/bin/env python
from io import open
with open('utf8.txt', 'w', encoding='utf-8') as file:
file.write(u'\u010d')
It is default on Python 3.
Note: you should not use the binary file mode ('b') if you want to write text.
# coding: utf8 that defines the source code encoding has nothing to do with it.
If you see sys.setdefaultencoding() outside of site.py or Python tests; assume the code is broken.
#ned-batchelder is right. You have to declare that the system default encoding is "utf-8". The coding comment # -*- coding: utf-8 -*- doesn't do this.
To declare the system default encoding, you have to import the module sys, and call sys.setdefaultencoding('utf-8'). However, sys was previously imported by the system and its setdefaultencoding method was removed. So you have to reload it before you call the method.
So, you will need to add the following codes at the beginning:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
You may need to explicitly declare that python use UTF-8 encoding.
The answer to this SO question explains how to do that: Declaring Encoding in Python
For Python 2:
Declare document encoding on top of the file (if not done yet):
# -*- coding: utf-8 -*-
Replace .decode with .encode:
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".encode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
When I use open() to open a file, I am not able to write unicode strings. I have learned that I need to use codecs and open the file with Unicode encoding (see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data).
Now I need to create some temporary files. I tried to use the tempfile library, but it doesn't have any encoding option. When I try to write any unicode string in a temporary file with tempfile, it fails:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
How can I create a temporary file with Unicode encoding in Python?
Edit:
I am using Linux and the error message that I get for this code is:
Traceback (most recent call last):
File "tmp_file.py", line 5, in <module>
fh.write(u"Hello World: ä")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 13: ordinal not in range(128)
This is just an example. In practice I am trying to write a string that some API returned.
Everyone else's answers are correct, I just want to clarify what's going on:
The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.
First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.
To get a Unicode string from a byte string, you call the .encode() method:
>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True
Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.
If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).
The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.
tempfile.TemporaryFile has encoding option in Python 3:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
fh.write("Hello World: ä")
fh.seek(0)
for line in fh:
print(line)
Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.
If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä".encode('utf-8'))
fh.seek(0)
for line in fh:
print line.decode('utf-8')
Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!
Since I am working on a Python program with TemporaryFile objects that should run in both Python 2 and Python 3, I don't find it satisfactory to manually encode all strings written as UTF-8 like the other answers suggest.
Instead, I have written the following small polyfill (because I could not find something like it in six) to wrap a binary file-like object into a UTF-8 file-like object:
from __future__ import unicode_literals
import sys
import codecs
if sys.hexversion < 0x03000000:
def uwriter(fp):
return codecs.getwriter('utf-8')(fp)
else:
def uwriter(fp):
return fp
It is used in the following way:
# encoding: utf-8
from tempfile import NamedTemporaryFile
with uwriter(NamedTemporaryFile(suffix='.txt', mode='w')) as fp:
fp.write('Hællo wörld!\n')
I have figured out one solution: create a temporary file that is not automatically deleted with tempfile, close it and open it again using codecs:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import codecs
import os
import tempfile
f = tempfile.NamedTemporaryFile(delete=False)
filename = f.name
f.close()
with codecs.open(filename, 'w+b', encoding='utf-8') as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
os.unlink(filename)
You are trying to write a unicode object (u"...") to the temporary file where you should use an encoded string ("..."). You don't have to explicitly pass an "encode=" parameter, because you've already stated the encoding in line two ("# -*- coding: utf-8 -*-"). Just use fh.write("ä") instead of fh.write(u"ä") and you should be fine.
Dropping the u made your code work for me:
fh.write("Hello World: ä")
I guess it's because it's already unicode.
Setting the sys as default encoding to UTF-8 will fix the encoding issue
import sys
reload(sys)
sys.setdefaultencoding('utf-8') #set to utf-8 by default this will solve the errors
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
To summarize: How do I print unicode system independently to produce play card symbols?
What I do wrong, I consider myself quite fluent in Python, except I seem not able to print correctly!
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import sys
symbols = ('♥','♦','♠','♣')
# red suits to sdterr for IDLE
print(' '.join(symbols[:2]), file=sys.stderr)
print(' '.join(symbols[2:]))
sys.stdout.write(symbols) # also correct in IDLE
print(' '.join(symbols))
Printing to console, which is main consern for console application, is failing miserably though:
J:\test>chcp
Aktiivinen koodisivu: 850
J:\test>symbol2
Traceback (most recent call last):
File "J:\test\symbol2.py", line 9, in <module>
print(''.join(symbols))
File "J:\Python26\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <unde
fined>
J:\test>chcp 437
Aktiivinen koodisivu: 437
J:\test>d:\Python27\python.exe symbol2.py
Traceback (most recent call last):
File "symbol2.py", line 6, in <module>
print(' '.join(symbols))
File "d:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2660' in position 0: character maps
o <undefined>
J:\test>
So summa summarum I have console application which works as long as you are not using console, but IDLE.
I can of course generate the symbols myself by producing them by chr:
# correct symbols for cp850
print(''.join(chr(n) for n in range(3,3+4)))
But this looks very stupid way to do it. And I do not make programs only run on Windows or have many special cases (like conditional compiling). I want readable code.
I do not mind which letters it outputs, as long as it looks correct no matter if it is Nokia phone, Windows or Linux. Unicode should do it but it does not print correctly to Console
Whenever I need to output utf-8 characters, I use the following approach:
import codecs
out = codecs.getwriter('utf-8')(sys.stdout)
str = u'♠'
out.write("%s\n" % str)
This saves me an encode('utf-8') every time something needs to be sent to sdtout/stderr.
Use Unicode strings and the codecs module:
Either:
# coding: utf-8
from __future__ import print_function
import sys
import codecs
symbols = (u'♠',u'♥',u'♦',u'♣')
print(u' '.join(symbols))
print(*symbols)
with codecs.open('test.txt','w','utf-8') as testfile:
print(*symbols, file=testfile)
or:
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import sys
import codecs
symbols = ('♠','♥','♦','♣')
print(' '.join(symbols))
print(*symbols)
with codecs.open('test.txt','w','utf-8') as testfile:
print(*symbols, file=testfile)
No need to re-implement print.
In response to the updated question
Since all you want to do is to print out UTF-8 characters on the CMD, you're out of luck, CMD does not support UTF-8:
Is there a Windows command shell that will display Unicode characters?
Old Answer
It's not totally clear what you're trying to do here, my best bet is that you want to write the encoded UTF-8 to a file.
Your problems are:
symbols = ('♠','♥', '♦','♣') while your file encoding maybe UTF-8, unless you're using Python 3 your strings wont be UTF-8 by default, you need to prefix them with a small u:
symbols = (u'♠', u'♥', u'♦', u'♣')
Your str(arg) converts the unicode string back into a normal one, just leave it out or use unicode(arg) to convert to a unicode string
The naming of .decode() may be confusing, this decodes bytes into UTF-8, but what you need to do is to encode UTF-8 into bytes so use .encode()
You're not writing to the file in binary mode, instead of open('test.txt', 'w') your need to use open('test.txt', 'wb') (notice the wb) this will open the file in binary mode which is important on windows
If we put all of this together we get:
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
symbols = (u'♠',u'♥', u'♦',u'♣')
print(' '.join(symbols))
print('Failure!')
def print(*args,**kwargs):
end = kwargs[end] if 'end' in kwargs else '\n'
sep = kwargs[sep] if 'sep' in kwargs else ' '
stdout = sys.stdout if 'file' not in kwargs else kwargs['file']
stdout.write(sep.join(unicode(arg).encode('utf-8') for arg in args))
stdout.write(end)
print(*symbols)
print('Success!')
with open('test.txt', 'wb') as testfile:
print(*symbols, file=testfile)
That happily writes the byte encoded UTF-8 to the file (at least on my Ubuntu box here).
UTF-8 in the Windows console is a long and painful story.
You can read issue 1602 and issue 6058 and have something that works, more or less, but it's fragile.
Let me summarise:
add 'cp65001' as an alias for 'utf8' in Lib/encodings/aliases.py
select Lucida Console or Consolas as your console font
run chcp 65001
run python
I am trying to learn python and couldn't figure out how to translate the following perl script to python:
#!/usr/bin/perl -w
use open qw(:std :utf8);
while(<>) {
s/\x{00E4}/ae/;
s/\x{00F6}/oe/;
s/\x{00FC}/ue/;
print;
}
The script just changes unicode umlauts to alternative ascii output. (So the complete output is in ascii.) I would be grateful for any hints. Thanks!
For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:
>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
Use the fileinput module to loop over standard input or a list of files,
decode the lines you read from UTF-8 to unicode objects
then map any unicode characters you desire with the translate method
translit.py would look like this:
#!/usr/bin/env python2.6
# -*- coding: utf-8 -*-
import fileinput
table = {
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): None,
}
for line in fileinput.input():
s = line.decode('utf8')
print s.translate(table),
And you could use it like this:
$ cat utf8.txt
sömé täßt
sömé täßt
sömé täßt
$ ./translit.py utf8.txt
soemé taet
soemé taet
soemé taet
Update:
In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:
line = 'Verhältnismäßigkeit, Möglichkeit'
table = {
ord('ä'): 'ae',
ord('ö'): 'oe',
ord('ü'): 'ue',
ord('ß'): 'ss',
}
line.translate(table)
>>> 'Verhaeltnismaessigkeit, Moeglichkeit'
You could try unidecode to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode Perl module:
#!/usr/bin/env python
import fileinput
import locale
from contextlib import closing
from unidecode import unidecode # $ pip install unidecode
def toascii(files=None, encoding=None, bufsize=-1):
if encoding is None:
encoding = locale.getpreferredencoding(False)
with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file:
for line in file:
print unidecode(line.decode(encoding)),
if __name__ == "__main__":
import sys
toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None)
It uses FileInput class to avoid global state.
Example:
$ echo 'äöüß' | python toascii.py utf-8
aouss
I use translitcodec
>>> import translitcodec
>>> print '\xe4'.decode('latin-1')
ä
>>> print '\xe4'.decode('latin-1').encode('translit/long').encode('ascii')
ae
>>> print '\xe4'.decode('latin-1').encode('translit/short').encode('ascii')
a
You can change the decode language to whatever you need. You may want a simple function to reduce length of a single implementation.
def fancy2ascii(s):
return s.decode('latin-1').encode('translit/long').encode('ascii')
Quick and dirty (python2):
def make_ascii(string):
return string.decode('utf-8').replace(u'ü','ue').replace(u'ö','oe').replace(u'ä','ae').replace(u'ß','ss').encode('ascii','ignore');