Writing unicode with python - what is wrong with this character

Writing unicode with python - what is wrong with this character - python

With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:
import codecs
with open('temp.txt','w') as temp:
temp.write(codecs.BOM_UTF16_LE)
text = unichr(33034) # text = u'\u810a'
temp.write(text.encode('utf-16-le'))
But either of these things, when replaced above, make the code work.
unichr(33033) and unichr(33035) work correctly.
'utf-8' encoding (without BOM, byte-order mark).
How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?

You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.
Open the file in binary mode instead:
open('temp.txt','wb')

#Joni's answer is the root of the problem, but if you use codecs.open instead it always opens in binary mode, even if not specified. Using the utf16 codec also automatically writes the BOM using native endian-ness as well:
import codecs
with codecs.open('temp.txt','w','utf16') as temp:
temp.write(u'\u810a')
Hex dump of temp.txt:
FF FE 0A 81
Reference: codecs.open

You're already using the codecs library. When working with that file, you should swap out using open() with codecs.open() to transparently handle encoding.
import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
temp.write(unichr(33033))
temp.write(unichr(33034))
temp.write(unichr(33035))
If you have a problem after that, you might have an issue with your viewer, not your Python script.

Related

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)

So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.

This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)

# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()

Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2

To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
CapitÃ¡n
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.

The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)

I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?

Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?

Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)

Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints ï¿½
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)

You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

Python 3: my unicode2shift-jis script works except writes ASCII file. Why?

I have a file with Unicode Japanese writing in it and I want to convert it to Shift-JIS and print it out to Shift-JIS encoded file. I do this:
with open("unikanji.txt", 'rb') as unikanjif:
unikanji = unikanjif.read()
sjskanji = unikanji.decode().encode('shift-jis')
with open("kanji.txt", 'wb') as sjskanjif:
sjskanjif.write(sjskanji)
It works except that when I open kanji.txt it always opens as an Ansi file, not Shift-JIS, and I see misc characters instead of Japanese. If I manually change the file encoding to Shift-JIS then the misc characters turn into the right Japanese characters. How do I make my program create the file as Shift-JIS to begin with?

"ANSI" is Microsoft's term for the default, localized encoding, which varies according to the localized version of Windows used. A Microsoft program like Notepad assumes "ANSI" for the encoding of a text file unless it starts with a byte order mark. Microsoft Notepad recogizes UTF-8, UTF-16LE and UTF-16BE BOMs.
Shift-JIS is a localized encoding, so you have to use an editor such as Notepad++ and manually configure it to Shift-JIS, as you have discovered. The file as you have written it is Shift-JIS-encoded, but unless the editor you use has some heuristic to detect the encoding it will have to be manually configured. You could also use Japanese Windows or change your localization default in your current Windows version and Shift-JIS might be the ANSI default.
By the way, converting encodings can be a little more straightforward. Below assumes the original file is UTF-8 and the target file will be shift-jis. utf-8-sig automatically handles and removes a byte order mark, if present.
with open('unikanji.txt',encoding='utf-8-sig') as f:
text = f.read()
with open('kanji.txt','w',encoding='shift-jis') as f:
f.write(text)

Reading a multibyte text file in Windows - how does it detect newlines? (Python 2)

I thought this was a caveat of a Unicode world -> you cannot correctly process a byte stream as writing without knowing what the encoding is. If you assume an encoding, then you might get valid - but incorrect - characters showing up.
Here's a test - a file with the writing:
hi1
hi2
stored on disk with a 2-byte Unicode encoding:
Windows newline characters are \r\n stored as the four byte sequence 0D 00 0A 00. Open it in Python 2 with default encodings, I think it's expecting ASCII 1-byte-per-character (or just a stream of bytes), and it reads:
>>> open('d:/t/hi2.txt').readlines()
['\xff\xfeh\x00i\x001\x00\r\x00\n',
'\x00h\x00i\x002\x00']
It's not decoding two bytes into one character, yet the four byte line ending sequence has been detected as two characters, and the file has been correctly split into two lines.
Presumably, then, Windows opened the file in 'text mode', as described here: Difference between files writen in binary and text mode
and fed the lines to Python. But how did Windows know the file was multibyte encoded, and to look for four-bytes of newlines, without being told, as per the caveat at the top of the question?
Does Windows guess, with a heuristic - and therefore can be wrong?
Is there more cleverness in the design of Unicode, something which makes Windows newline patterns unambiguous across encodings?
Is my understanding wrong, and there is a correct way to process any text file without being told the encoding beforehand?

The result in this case has nothing to do with Windows or the standard I/O implementation of Microsoft's C runtime. You'll see the same result if you test this in Python 2 on a Linux system. It's just how file.readlines (2.7.12 source link) works in Python 2. See line 1717, p = (char *)memchr(buffer+nfilled, '\n', nread) and then line 1749, line = PyString_FromStringAndSize(q, p-q). It naively consumes up to a \n character, which is why the actual UTF-16LE \n\x00 sequence gets split up.
If you had opened the file using Python 2's universal newlines mode, e.g. open('d:/t/hi2.txt', 'U'), the \r\x00 sequences would naively be translated to \n\x00. The result of readlines would instead be ['\xff\xfeh\x00i\x001\x00\n, \x00\n', '\x00h\x00i\x002\x00'].
Thus your initial supposition is correct. You need to know the encoding, or at least know to look for a Unicode BOM (byte order mark) at the start of the file, such as \xff\xfe, which indicates UTF-16LE (little endian). To that end I recommend using the io module in Python 2.7, since it properly handles newline translation. codecs.open, on the other hand, requires binary mode on the wrapped file and ignores universal newline mode:
>>> codecs.open('test.txt', 'U', encoding='utf-16').readlines()
[u'hi1\r\n', u'hi2']
io.open returns a TextIOWrapper that has built-in support for universal newlines:
>>> io.open('test.txt', encoding='utf-16').readlines()
[u'hi1\n', u'hi2']
Regarding Microsoft's CRT, it defaults to ANSI text mode. Microsoft's ANSI codepages are supersets of ASCII, so the CRT's newline translation will work for files encoded with an ASCII compatible encoding such as UTF-8. On the other hand, ANSI text mode doesn't work for a UTF-16 encoded file, i.e. it doesn't remove the UTF-16LE BOM (\xff\xfe) and doesn't translate newlines:
>>> open('test.txt').read()
'\xff\xfeh\x00i\x001\x00\r\x00\n\x00h\x00i\x002\x00'
Thus using standard I/O text mode for a UTF-16 encoded file requires the non-standard ccs flag, e.g. fopen("d:/t/hi2.txt", "rt, ccs=UNICODE"). Python doesn't support this Microsoft extension to the open mode, but it does make the CRT's low I/O (POSIX) _open and _read functions available in the os module. While it might surprise POSIX programmers, Microsoft's low I/O API also supports text mode, including Unicode. For example:
>>> O_WTEXT = 0x10000
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 100)
'h\x00i\x001\x00\n\x00h\x00i\x002\x00'
>>> os.close(fd)
The O_WTEXT constant isn't made directly available in Windows Python because it's not safe to open a file descriptor with this mode as a Python file using os.fdopen. The CRT expects all wide-character buffers to be a multiple of the size of a wchar_t, i.e. a multiple of 2. Otherwise it invokes the invalid parameter handler that kills the process. For example (using the cdb debugger):
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 7)
ntdll!NtTerminateProcess+0x14:
00007ff8`d9cd5664 c3 ret
0:000> k8
Child-SP RetAddr Call Site
00000000`005ef338 00007ff8`d646e219 ntdll!NtTerminateProcess+0x14
00000000`005ef340 00000000`62db5200 KERNELBASE!TerminateProcess+0x29
00000000`005ef370 00000000`62db52d4 MSVCR90!_invoke_watson+0x11c
00000000`005ef960 00000000`62db0cff MSVCR90!_invalid_parameter+0x70
00000000`005ef9a0 00000000`62db0e29 MSVCR90!_read_nolock+0x76b
00000000`005efa40 00000000`1e056e8a MSVCR90!_read+0x10d
00000000`005efaa0 00000000`1e0c3d49 python27!Py_Main+0x12a8a
00000000`005efae0 00000000`1e1146d4 python27!PyCFunction_Call+0x69
The same applies to _O_UTF8 and _O_UTF16.

First things first: open your file as text, indicating the correct encodin,and in explicit text mode.
If you are still using Python 2.7, use codecs.open instead of open. In Python 3.x, just use open:
import codecs
myfile = codecs.open('d:/t/hi2.txt', 'rt', encoding='utf-16')
And you should be able to work on it.
Second, what is likley going on there: Since you did not specify you were opening the file in binary mode, Windows open it in "text" mode - Windows does know about the encoding, and thus, can find the \r\n sequences in the lines - it reads the lines separately, performing the end-of-line translation - using utf-16 - and passes those utf-16 bytes to Python.
On the Python side, you could use these values, just decoding them to text:
[line.decode("utf-16" for line in open('d:/t/hi2.txt')]
instead of
open('d:/t/hi2.txt').readlines()

python utf-8-sig BOM in the middle of the file when appending to the end

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
The following text ends up to the file:
<BOM>123
<BOM>123
Isn't that a bug? This is so not logical.
Could anyone explain to me why it was done so?
Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?

No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.
Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.
If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.
Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing unicode with python - what is wrong with this character - python

Related

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

python codecs can't encode to cp1252...but notepad++ can?

Python 3: my unicode2shift-jis script works except writes ASCII file. Why?

Reading a multibyte text file in Windows - how does it detect newlines? (Python 2)

python utf-8-sig BOM in the middle of the file when appending to the end

Categories

Resources