Python JSON Unicode Error OrderedDict - python

I want to be able to view the file in the editor and see an automatically ü.
# -*- coding: utf-8 -*-
import json
from collections import OrderedDict
fdata = OrderedDict()
fdata[u"Züge"] = 0
fdata[u"Bahnhöfe"] = 0
with open("Desktop/test.json", "w") as outfile:
json.dump(fdata, outfile, indent=2, ensure_ascii=False)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
position 2: ordinal not in range(128)
It has something to do with OrderedDict, with a normal dict it works.

You don't specify an encoding when opening the file, so outfile.encoding is probably None.
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
And your system default encoding is apparently ascii.
Instead, open your file with the desired encoding:
import codecs
with codecs.open("test.json", "w", encoding='utf-8') as outfile:

I had a similar issue once, I've added this line at the top of my .py file and it worked.
# coding=utf-8

Related

Not able to read file due to unicode error in python

I'm trying to read a file and when I'm reading it, I'm getting a unicode error.
def reading_File(self,text):
url_text = "Text1.txt"
with open(url_text) as f:
content = f.read()
Error:
content = f.read()# Read the whole file
File "/home/soft/anaconda/lib/python3.6/encodings/ascii.py", line 26, in
decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 404:
ordinal not in range(128)
Why is this happening? I'm trying to run the same on Linux system, but on Windows it runs properly.
According to the question,
i'm trying to run the same on Linux system, but on Windows it runs properly.
Since we know from the question and some of the other answers that the file's contents are neither ASCII nor UTF-8, it's a reasonable guess that the file is encoded with one of the 8-bit encodings common on Windows.
As it happens 0x92 maps to the character 'RIGHT SINGLE QUOTATION MARK' in the cp125* encodings, used on US and latin/European regions.
So probably the the file should be opened like this:
# Python3
with open(url_text, encoding='cp1252') as f:
content = f.read()
# Python2
import codecs
with codecs.open(url_text, encoding='cp1252') as f:
content = f.read()
There can be two reasons for that to happen:
The file contains text encoded with an encoding different than 'ascii' and, according you your comments to other answers, 'utf-8'.
The file doesn't contain text at all, it is binary data.
In case 1 you need to figure out how the text was encoded and use that encoding to open the file:
open(url_text, encoding=your_encoding)
In case 2 you need to open the file in binary mode:
open(url_text, 'rb')
As it looks, default encoding is ascii while Python3 it's utf-8, below syntax to open the file can be used
open(file, encoding='utf-8')
Check your system default encoding,
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If it's not UTF-8, reset the encoding of your system.
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_TYPE=en_US.UTF-8
You can use codecs.open to fix this issue with the correct encoding:
import codecs
with codecs.open(filename, 'r', 'utf8' ) as ff:
content = ff.read()

Error when non-English characters present [duplicate]

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
But then if I open the actual text file, I see lots of symbols like:
Qur’an
Maybe I need to write to something other than a text file?
Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')
In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:
import io
with io.open(filename, 'w', encoding=character_encoding) as file:
file.write(unicode_text)
It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.
Unicode string handling is already standardized in Python 3.
char's are already stored in Unicode (32-bit) in memory
You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)
out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()
Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open from the io module (this is the same as the builtin open in Python 3):
import io
Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm → HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.
The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)
You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.
How to print unicode characters into a file:
Save this to file: foo.py:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')
Run it and pipe output to file:
python foo.py > tmp.txt
Open tmp.txt and look inside, you see this:
el#apollo:~$ cat tmp.txt
e with obfuscation: é
Thus you have saved unicode e with a obfuscation mark on it to a file.
That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it's in plain ASCII. There are two possibilities:
You're encoding it to a bytestring, but because you've used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.
In case of writing in python3
>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'
In case of writing in python2:
>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:
>>> f.write(a.encode("utf-8"))
>>> f.close()
and decode the data while reading using the codecs "utf-8":
>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'
And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this
>>> print a
batsà

python ascii vs unicode (utf-8) [duplicate]

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page).
It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?
Currently I'm converting everything to Unicode on the way in, joining it all together in a Python string, then doing:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
There is an encoding error on the last line:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
12286: ordinal not in range(128)
Partial solution:
This Python runs without an error:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
But then if I open the actual text file, I see lots of symbols like:
Qur’an
Maybe I need to write to something other than a text file?
Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.
If your string is actually a unicode object, you'll need to convert it to a unicode-encoded string object before writing it to a file:
foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()
When you read that file again, you'll get a unicode-encoded string that you can decode to a unicode object:
f = file('test', 'r')
print f.read().decode('utf8')
In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:
import io
with io.open(filename, 'w', encoding=character_encoding) as file:
file.write(unicode_text)
It might be more convenient if you need to write the text incrementally (you don't need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.
Unicode string handling is already standardized in Python 3.
char's are already stored in Unicode (32-bit) in memory
You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)
out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()
Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open from the io module (this is the same as the builtin open in Python 3):
import io
Best practice, in general, use UTF-8 for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm → HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.
The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn't unicode; you take unicode and encode it in iso-8859-1 yourself. That's what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)
You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.
How to print unicode characters into a file:
Save this to file: foo.py:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')
Run it and pipe output to file:
python foo.py > tmp.txt
Open tmp.txt and look inside, you see this:
el#apollo:~$ cat tmp.txt
e with obfuscation: é
Thus you have saved unicode e with a obfuscation mark on it to a file.
That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it's in plain ASCII. There are two possibilities:
You're encoding it to a bytestring, but because you've used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.
In case of writing in python3
>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'
In case of writing in python2:
>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To avoid this error you would have to encode it to bytes using codecs "utf-8" like this:
>>> f.write(a.encode("utf-8"))
>>> f.close()
and decode the data while reading using the codecs "utf-8":
>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'
And also if you try to execute print on this string it will automatically decode using the "utf-8" codecs like this
>>> print a
batsà

How to create a temporary file with Unicode encoding?

When I use open() to open a file, I am not able to write unicode strings. I have learned that I need to use codecs and open the file with Unicode encoding (see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data).
Now I need to create some temporary files. I tried to use the tempfile library, but it doesn't have any encoding option. When I try to write any unicode string in a temporary file with tempfile, it fails:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
How can I create a temporary file with Unicode encoding in Python?
Edit:
I am using Linux and the error message that I get for this code is:
Traceback (most recent call last):
File "tmp_file.py", line 5, in <module>
fh.write(u"Hello World: ä")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 13: ordinal not in range(128)
This is just an example. In practice I am trying to write a string that some API returned.
Everyone else's answers are correct, I just want to clarify what's going on:
The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.
First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.
To get a Unicode string from a byte string, you call the .encode() method:
>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True
Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.
If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).
The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.
tempfile.TemporaryFile has encoding option in Python 3:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
fh.write("Hello World: ä")
fh.seek(0)
for line in fh:
print(line)
Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.
If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä".encode('utf-8'))
fh.seek(0)
for line in fh:
print line.decode('utf-8')
Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!
Since I am working on a Python program with TemporaryFile objects that should run in both Python 2 and Python 3, I don't find it satisfactory to manually encode all strings written as UTF-8 like the other answers suggest.
Instead, I have written the following small polyfill (because I could not find something like it in six) to wrap a binary file-like object into a UTF-8 file-like object:
from __future__ import unicode_literals
import sys
import codecs
if sys.hexversion < 0x03000000:
def uwriter(fp):
return codecs.getwriter('utf-8')(fp)
else:
def uwriter(fp):
return fp
It is used in the following way:
# encoding: utf-8
from tempfile import NamedTemporaryFile
with uwriter(NamedTemporaryFile(suffix='.txt', mode='w')) as fp:
fp.write('Hællo wörld!\n')
I have figured out one solution: create a temporary file that is not automatically deleted with tempfile, close it and open it again using codecs:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import codecs
import os
import tempfile
f = tempfile.NamedTemporaryFile(delete=False)
filename = f.name
f.close()
with codecs.open(filename, 'w+b', encoding='utf-8') as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line
os.unlink(filename)
You are trying to write a unicode object (u"...") to the temporary file where you should use an encoded string ("..."). You don't have to explicitly pass an "encode=" parameter, because you've already stated the encoding in line two ("# -*- coding: utf-8 -*-"). Just use fh.write("ä") instead of fh.write(u"ä") and you should be fine.
Dropping the u made your code work for me:
fh.write("Hello World: ä")
I guess it's because it's already unicode.
Setting the sys as default encoding to UTF-8 will fix the encoding issue
import sys
reload(sys)
sys.setdefaultencoding('utf-8') #set to utf-8 by default this will solve the errors
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä")
fh.seek(0)
for line in fh:
print line

Write to UTF-8 file in Python

I'm really confused with the codecs.open function. When I do:
file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()
It gives me the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
0: ordinal not in range(128)
If I do:
file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()
It works fine.
Question is why does the first method fail? And how do I insert the bom?
If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?
I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"
Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:
import codecs
file = codecs.open("lol", "w", "utf-8")
file.write(u'\ufeff')
file.close()
(That seems to give the right answer - a file with bytes EF BB BF.)
EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.
Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig
Do this
with codecs.open("test_output", "w", "utf-8-sig") as temp:
temp.write("hi mom\n")
temp.write(u"This has ♭")
The resulting file is UTF-8 with the expected BOM.
It is very simple just use this. Not any library needed.
with open('text.txt', 'w', encoding='utf-8') as f:
f.write(text)
#S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.
Jon Skeet is right (unusual) about the codecs module - it contains byte strings:
>>> import codecs
>>> codecs.BOM
'\xff\xfe'
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>>
Picking another nit, the BOM has a standard Unicode name, and it can be entered as:
>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'\ufeff'
It is also accessible via unicodedata:
>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'\ufeff'
>>>
I use the file *nix command to convert a unknown charset file in a utf-8 file
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
python 3.4 >= using pathlib:
import pathlib
pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM
If you are using Pandas I/O methods like pandas.to_excel(), add an encoding parameter, e.g.
pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8')
This works for most international characters I believe.

Categories

Resources