Unicode Byte Order Mark (BOM) as a python constant?

Unicode Byte Order Mark (BOM) as a python constant? - python

It's not a real problem in practice, since I can just write BOM = "\uFEFF"; but it bugs me that I have to hard-code a magic constant for such a basic thing. [Edit: And it's error prone! I had accidentally written the BOM as \uFFFE in this question, and nobody noticed. It even led to an incorrect proposed solution.] Surely python defines it in a handy form somewhere?
Searching turned up a series of constants in the codecs module: codecs.BOM, codecs.BOM_UTF8, and so on. But these are bytes objects, not strings. Where is the real BOM?
This is for python 3, but I would be interested in the Python 2 situation for completeness.

There isn't one. The bytes constants in codecs are what you should be using.
This is because you should never see a BOM in decoded text (i.e., you shouldn't encounter a string that actually encodes the code point U+FEFF). Rather, the BOM exists as a byte pattern at the start of a stream, and when you decode some bytes with a BOM, the U+FEFF isn't included in the output string. Similarly, the encoding process should handle adding any necessary BOM to the output bytes---it shouldn't be in the input string.
The only time a BOM matters is when either converting into or converting from bytes.

I suppose you could use:
unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
but it's not as clean as what you already have

Related

Problems with unicode, beautifulsoup, cld2, and python [duplicate]

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?

... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.

Convert to unicode from cp437. This way you get your bytes right to unicode and back.

What endianness does Python use to write into files?

When using file.write() with 'wb' flag does Python use big or litte endian, or sys.byteorder value ? how can i be sure that the endianness is not random, I am asking because I am mixing ASCII and binary data in the same file and for the binary data i use struct.pack() and force it to little endian, but I am not sure what happen to the ASCII data !
Edit 1: since the downvote, I'll explain more my question !
I am writing a file with ASCII and binary data, in a x86 PC, the file will be sent over the network to another computer witch is not x86, a PowerPC, witch is on Big-endian, how can I be sure that the data will be the same when parsed with the PowerPC ?
Edit 2: still using Python 2.7

For multibyte data, It follows the architecture of the machine by default. If you need it to work cross-platform, then you'll want to force it.
ASCII and UTF-8 are encoded as a single byte per character, so is it affected by the byte ordering? No.
Here is how to pack little < or big > endian:
import struct
struct.pack('<L', 1234)
'\xd2\x04\x00\x00'
struct.pack('>L', 1234)
'\x00\x00\x04\xd2'
You can also encode strings as big or little endian this way if you are using UTF-16, as an example:
s.encode('utf-16LE')
s.encode('utf-16BE')
UTF-8, ASCII do not have endianness since it is 1 byte per character.

It uses sys.byteorder. So just:
import sys
if 'little' == sys.byteorder:
# little
else:
# big

Note: I assume Python 3.
Endianness is not a concern when writing ASCII or byte strings. The order of the bytes is already set by the order in which those bytes occur in the ASCII/byte string. Endianness is a property of encodings that maps some value (e.g. a 16 bit integer or a Unicode code point) to several bytes. By the time you have a byte string, the endianness has already been decided and applied (by the source of the byte string).
If you were to write unicode strings to file not opened with b mode, the question depends on how those strings are encoded (they are necessarily encoded, because the file system only accept bytes). The encoding in turn depends on the file, and possibly on the locale or environment variables (e.g. for the default sys.stdout). When this causes problems, the problems extend beyond just endianness. However, your file is binary, so you can't write unicode directly anyway, you have to explicitly encode and decode. Do this with any fixed encoding and there won't be endianness issues, as an encoding's endianness is fixed and part of the definition of the encoding.

Bulletproof work with encoding in Python

The question about unicode in Python2.
As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.
These statements are right, ain't they?
But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?
I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?

... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.
...
These statements are right, ain't they?
No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.
Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.
There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

Try wrapping your functions in try:except: calls.
Try decoding as utf-8:
Catch exception if not utf-8:
if exception raised, try next encoding:
etc, etc...
Make it a function that returns str when (and if) it finds an encoding that wasn't excepted, and returns None or an empty str when it exhausts its list of encodings and the last exception is raised.
Like the others said, the encoding should be recorded somewhere, so check that first.
Not efficient, and frankly due to my skill level, may be way off, but to my newbie mind, it may alleviate some of the problems when dealing with unknown or undocumented encoding.

Convert to unicode from cp437. This way you get your bytes right to unicode and back.

Do UTF-8 characters cover all encodings of ISO8859-xx and windows-12xx?

I am trying to write a generic document indexer from a bunch of documents with different encodings in python. I would like to know if it is possible to read all of my documents (that are encoded with utf-8,ISO8859-xx and windows-12xx) with utf-8 without character loss?
The reading part is as follows:
fin=codecs.open(doc_name, "r","utf-8");
doc_content=fin.read()

I'm going to rephrase your question slightly. I believe you are asking, "can I open a document and read it as if it were UTF-8, provided that it is actually intended to be ISO8869-xx or Windows-12xx, without loss?". This is what the Python code you've posted attempts to do.
The answer to that question is no. The Python code you posted will mangle the documents if they contain any characters above ordinal 127. This is because the "codepages" use the numbers from 128 to 255 to represent one character each, where UTF-8 uses that number range to proxy multibyte characters. So, each character in your document which is not in ASCII will be either interpreted as an invalid string or will be combined with the succeeding byte(s) to form a single UTF-8 codepoint, if you incorrectly parse the file as UTF-8.
As a concrete example, say your document is in Windows-1252. It contains the byte sequence 0xC3 0xAE, or "Ã®" (A-tilde, registered trademark sign). In UTF-8, that same byte sequence represents one character, "ï" (small 'i' with diaresis). In Windows-874, that same sequence would be "รฎ". These are rather different strings - a moral insult could become an invitation to play chess, or vice versa. Meaning is lost.
Now, for a slightly different question - "can I losslessly convert my files from their current encoding to UTF-8?" or, "can I represent all the data from the current files as a UTF-8 bytestream?". The answer to these questions is (modulo a few fuzzy bits) yes. Unicode is designed to have a codepoint for every ideoglyph in any previously existing codepage, and by and large has succeeded in this goal. There are a few rough edges, but you will likely be well-served by using Unicode as your common interchange format (and UTF-8 is a good choice for a representation thereof).
However, to effect the conversion, you must already know and state the format in which the files exist as they are being read. Otherwise Python will incorrectly deal with non-ASCII characters and you will badly damage your text (irreparably, in fact, if you discard either the invalid-in-UTF8 sequences or the origin of a particular wrongly-converted byte range).
In the event that the text is all, 100% ASCII, you can open it as UTF-8 without a problem, as the first 127 codepoints are shared between the two representations.

UTF-8 covers everything in Unicode. I don't know for sure whether ISO-8859-xx and Windows-12xx are entirely covered by Unicode, but I strongly suspect they are.
I believe there are some encodings which include characters which aren't in Unicode, but I would be fairly surprised if you came across those characters. Covering the whole of Unicode is "good enough" for almost everything - that's the purpose of Unicode, after all. It's meant to cover everything we could possibly need (which is why it's grown :)
EDIT: As noted, you have to know the encoding of the file yourself, and state it - you can't just expect files to magically be read correctly. But once you do know the encoding, you could convert everything to UTF-8.

You'll need to have some way of determining which character set the document uses. You can't just open each one as "utf-8" and expect it to get magically converted. Open it with the proper character set, then convert.
The best way to be sure would be to convert a large set of documents, then convert them back and do a comparison.

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.

u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.

Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.