How to create a temporary file with Unicode encoding? - python

When I use open() to open a file, I am not able to write unicode strings. I have learned that I need to use codecs and open the file with Unicode encoding (see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data).
Now I need to create some temporary files. I tried to use the tempfile library, but it doesn't have any encoding option. When I try to write any unicode string in a temporary file with tempfile, it fails:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
How can I create a temporary file with Unicode encoding in Python?
Edit:
I am using Linux and the error message that I get for this code is:
Traceback (most recent call last):
File "tmp_file.py", line 5, in <module>
fh.write(u"Hello World: ä")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 13: ordinal not in range(128)
This is just an example. In practice I am trying to write a string that some API returned.

Everyone else's answers are correct, I just want to clarify what's going on:
The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.
First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.
To get a Unicode string from a byte string, you call the .encode() method:
>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True
Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.
If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).
The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.

tempfile.TemporaryFile has encoding option in Python 3:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
fh.write("Hello World: ä")
fh.seek(0)
for line in fh:
print(line)
Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.
If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä".encode('utf-8'))
fh.seek(0)
for line in fh:
print line.decode('utf-8')
Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!

Since I am working on a Python program with TemporaryFile objects that should run in both Python 2 and Python 3, I don't find it satisfactory to manually encode all strings written as UTF-8 like the other answers suggest.
Instead, I have written the following small polyfill (because I could not find something like it in six) to wrap a binary file-like object into a UTF-8 file-like object:
from __future__ import unicode_literals
import sys
import codecs
if sys.hexversion < 0x03000000:
def uwriter(fp):
return codecs.getwriter('utf-8')(fp)
else:
def uwriter(fp):
return fp
It is used in the following way:
# encoding: utf-8
from tempfile import NamedTemporaryFile
with uwriter(NamedTemporaryFile(suffix='.txt', mode='w')) as fp:
fp.write('Hællo wörld!\n')

I have figured out one solution: create a temporary file that is not automatically deleted with tempfile, close it and open it again using codecs:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import codecs
import os
import tempfile
f = tempfile.NamedTemporaryFile(delete=False)
filename = f.name
f.close()
with codecs.open(filename, 'w+b', encoding='utf-8') as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
os.unlink(filename)

You are trying to write a unicode object (u"...") to the temporary file where you should use an encoded string ("..."). You don't have to explicitly pass an "encode=" parameter, because you've already stated the encoding in line two ("# -*- coding: utf-8 -*-"). Just use fh.write("ä") instead of fh.write(u"ä") and you should be fine.

Dropping the u made your code work for me:
fh.write("Hello World: ä")
I guess it's because it's already unicode.

Setting the sys as default encoding to UTF-8 will fix the encoding issue
import sys
reload(sys)
sys.setdefaultencoding('utf-8') #set to utf-8 by default this will solve the errors
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line

Related

python write unicode to file easily?

I want to make sure all string are unicode in my code, so I use unicode_literals, then I need to write string to file:
from __future__ import unicode_literals
with open('/tmp/test', 'wb') as f:
f.write("中文") # UnicodeEncodeError
so I need to do this:
from __future__ import unicode_literals
with open('/tmp/test', 'wb') as f:
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
f.write("中文".encode("utf-8"))
but every time I need to encode in code, I am lazy, so I change to codecs:
from __future__ import unicode_literals
from codecs import open
import locale, codecs
lang, encoding = locale.getdefaultlocale()
with open('/tmp/test', 'wb', encoding) as f:
f.write("中文")
still I think this is too much if I just want to write to file, any easier method?
You don't need to call .encode() and you don't need to call locale.getdefaultlocale() explicitly:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import io
with io.open('/tmp/test', 'w') as file:
file.write(u"中文" * 4)
It uses locale.getpreferredencoding(False) character encoding to save Unicode text to the file.
On Python 3:
you don't need to use the explicit encoding declaration (# -*- coding: utf-8 -*-), to use literal non-ascii characters in your Python source code. utf-8 is the default.
you don't need to use import io: builtin open() is io.open() there
you don't need to use u'' (u prefix). '' literals are Unicode by default. If you want to omit u'' then put back from __future__ import unicode_literals as in your code in the question.
i.e., the complete Python 3 code is:
#!/usr/bin/env python3
with open('/tmp/test', 'w') as file:
file.write("中文" * 4)
What about this solution?
Write to UTF-8 file in Python
Only three lines of code.

Python JSON Unicode Error OrderedDict

I want to be able to view the file in the editor and see an automatically ü.
# -*- coding: utf-8 -*-
import json
from collections import OrderedDict
fdata = OrderedDict()
fdata[u"Züge"] = 0
fdata[u"Bahnhöfe"] = 0
with open("Desktop/test.json", "w") as outfile:
json.dump(fdata, outfile, indent=2, ensure_ascii=False)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
position 2: ordinal not in range(128)
It has something to do with OrderedDict, with a normal dict it works.
You don't specify an encoding when opening the file, so outfile.encoding is probably None.
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
And your system default encoding is apparently ascii.
Instead, open your file with the desired encoding:
import codecs
with codecs.open("test.json", "w", encoding='utf-8') as outfile:
I had a similar issue once, I've added this line at the top of my .py file and it worked.
# coding=utf-8

Python: Can't write to file - UnicodeEncodeError

This code should write some text to file.
When I'm trying to write my text to console, everything works. But when I try to write the text into the file, I get UnicodeEncodeError. I know, that this is a common problem which can be solved using proper decode or encode, but I tried it and still getting the same UnicodeEncodeError. What am I doing wrong?
I've attached an example.
print "(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2])
(StarBuy s.r.o.,Inzertujte s foto, auto-moto, oblečenie, reality, prácu, zvieratá, starožitnosti, dovolenky, nábytok, všetko pre deti, obuv, stroj....
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u010d' in position 50: ordinal not in range(128)
Where could be the problem?
To write Unicode text to a file, you could use io.open() function:
#!/usr/bin/env python
from io import open
with open('utf8.txt', 'w', encoding='utf-8') as file:
file.write(u'\u010d')
It is default on Python 3.
Note: you should not use the binary file mode ('b') if you want to write text.
# coding: utf8 that defines the source code encoding has nothing to do with it.
If you see sys.setdefaultencoding() outside of site.py or Python tests; assume the code is broken.
#ned-batchelder is right. You have to declare that the system default encoding is "utf-8". The coding comment # -*- coding: utf-8 -*- doesn't do this.
To declare the system default encoding, you have to import the module sys, and call sys.setdefaultencoding('utf-8'). However, sys was previously imported by the system and its setdefaultencoding method was removed. So you have to reload it before you call the method.
So, you will need to add the following codes at the beginning:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
You may need to explicitly declare that python use UTF-8 encoding.
The answer to this SO question explains how to do that: Declaring Encoding in Python
For Python 2:
Declare document encoding on top of the file (if not done yet):
# -*- coding: utf-8 -*-
Replace .decode with .encode:
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".encode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))

Unicode not printing correctly to cp850 (cp437), play card suits

To summarize: How do I print unicode system independently to produce play card symbols?
What I do wrong, I consider myself quite fluent in Python, except I seem not able to print correctly!
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import sys
symbols = ('♥','♦','♠','♣')
# red suits to sdterr for IDLE
print(' '.join(symbols[:2]), file=sys.stderr)
print(' '.join(symbols[2:]))
sys.stdout.write(symbols) # also correct in IDLE
print(' '.join(symbols))
Printing to console, which is main consern for console application, is failing miserably though:
J:\test>chcp
Aktiivinen koodisivu: 850
J:\test>symbol2
Traceback (most recent call last):
File "J:\test\symbol2.py", line 9, in <module>
print(''.join(symbols))
File "J:\Python26\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <unde
fined>
J:\test>chcp 437
Aktiivinen koodisivu: 437
J:\test>d:\Python27\python.exe symbol2.py
Traceback (most recent call last):
File "symbol2.py", line 6, in <module>
print(' '.join(symbols))
File "d:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2660' in position 0: character maps
o <undefined>
J:\test>
So summa summarum I have console application which works as long as you are not using console, but IDLE.
I can of course generate the symbols myself by producing them by chr:
# correct symbols for cp850
print(''.join(chr(n) for n in range(3,3+4)))
But this looks very stupid way to do it. And I do not make programs only run on Windows or have many special cases (like conditional compiling). I want readable code.
I do not mind which letters it outputs, as long as it looks correct no matter if it is Nokia phone, Windows or Linux. Unicode should do it but it does not print correctly to Console
Whenever I need to output utf-8 characters, I use the following approach:
import codecs
out = codecs.getwriter('utf-8')(sys.stdout)
str = u'♠'
out.write("%s\n" % str)
This saves me an encode('utf-8') every time something needs to be sent to sdtout/stderr.
Use Unicode strings and the codecs module:
Either:
# coding: utf-8
from __future__ import print_function
import sys
import codecs
symbols = (u'♠',u'♥',u'♦',u'♣')
print(u' '.join(symbols))
print(*symbols)
with codecs.open('test.txt','w','utf-8') as testfile:
print(*symbols, file=testfile)
or:
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import sys
import codecs
symbols = ('♠','♥','♦','♣')
print(' '.join(symbols))
print(*symbols)
with codecs.open('test.txt','w','utf-8') as testfile:
print(*symbols, file=testfile)
No need to re-implement print.
In response to the updated question
Since all you want to do is to print out UTF-8 characters on the CMD, you're out of luck, CMD does not support UTF-8:
Is there a Windows command shell that will display Unicode characters?
Old Answer
It's not totally clear what you're trying to do here, my best bet is that you want to write the encoded UTF-8 to a file.
Your problems are:
symbols = ('♠','♥', '♦','♣') while your file encoding maybe UTF-8, unless you're using Python 3 your strings wont be UTF-8 by default, you need to prefix them with a small u:
symbols = (u'♠', u'♥', u'♦', u'♣')
Your str(arg) converts the unicode string back into a normal one, just leave it out or use unicode(arg) to convert to a unicode string
The naming of .decode() may be confusing, this decodes bytes into UTF-8, but what you need to do is to encode UTF-8 into bytes so use .encode()
You're not writing to the file in binary mode, instead of open('test.txt', 'w') your need to use open('test.txt', 'wb') (notice the wb) this will open the file in binary mode which is important on windows
If we put all of this together we get:
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
symbols = (u'♠',u'♥', u'♦',u'♣')
print(' '.join(symbols))
print('Failure!')
def print(*args,**kwargs):
end = kwargs[end] if 'end' in kwargs else '\n'
sep = kwargs[sep] if 'sep' in kwargs else ' '
stdout = sys.stdout if 'file' not in kwargs else kwargs['file']
stdout.write(sep.join(unicode(arg).encode('utf-8') for arg in args))
stdout.write(end)
print(*symbols)
print('Success!')
with open('test.txt', 'wb') as testfile:
print(*symbols, file=testfile)
That happily writes the byte encoded UTF-8 to the file (at least on my Ubuntu box here).
UTF-8 in the Windows console is a long and painful story.
You can read issue 1602 and issue 6058 and have something that works, more or less, but it's fragile.
Let me summarise:
add 'cp65001' as an alias for 'utf8' in Lib/encodings/aliases.py
select Lucida Console or Consolas as your console font
run chcp 65001
run python

Write to UTF-8 file in Python

I'm really confused with the codecs.open function. When I do:
file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()
It gives me the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
0: ordinal not in range(128)
If I do:
file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()
It works fine.
Question is why does the first method fail? And how do I insert the bom?
If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?
I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"
Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:
import codecs
file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()
(That seems to give the right answer - a file with bytes EF BB BF.)
EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.
Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig
Do this
with codecs.open("test_output", "w", "utf-8-sig") as temp:
temp.write("hi mom\n")
temp.write(u"This has ♭")
The resulting file is UTF-8 with the expected BOM.
It is very simple just use this. Not any library needed.
with open('text.txt', 'w', encoding='utf-8') as f:
f.write(text)
#S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.
Jon Skeet is right (unusual) about the codecs module - it contains byte strings:
>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>>
Picking another nit, the BOM has a standard Unicode name, and it can be entered as:
>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'
It is also accessible via unicodedata:
>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>>
I use the file *nix command to convert a unknown charset file in a utf-8 file
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
python 3.4 >= using pathlib:
import pathlib
pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM
If you are using Pandas I/O methods like pandas.to_excel(), add an encoding parameter, e.g.
pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8')
This works for most international characters I believe.

Categories

Resources