String change from Latin to ASCII - python

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')

It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

Related

Is json.dumps and json.loads safe to run on a list of any string?

Is there any danger in losing information when JSON serialising/deserialising lists of text in Python?
Given a list of strings lst:
lst = ['str1', 'str2', 'str3', ...]
If I run
lst2 = json.loads(json.dumps(lst))
Will lst always be exactly the same as lst2 (i.e. will lst == lst2 always result to True)? Or are there some special, unusual characters that would break either of these methods?
I'm curious because I'll be dealing with a lot of different and unusual characters from various Unicode ranges and I would like to be absolutely certain that this process is 100% robust.
Depends on what you mean by "exactly the same". We can identify three separate issues:
Semantic identity. What you read in is equivalent in meaning to what you write back out, as long as it's well-defined in the first place. Python (depending on version) might reorder dictionary keys, and will commonly prefer Unicode escapes over literal characters for some code points, and vice versa.
>>> json.loads(json.dumps("\u0050\U0001fea5\U0001f4a9"))
'P\U0001fea5💩'
Lexical identity. Nope. As shown above, the JSON representation of Unicode code points can get normalized in different ways, so that \u0050 gets turned into a literal P, and printable emoji may or may not similarly be turned into Unicode escapes, or vice versa.
(This is distinct from proper Unicode normalization, which makes sure that homoglyphs get turned into the same precise code point.)
Garbage in, same garbage out. Nope. If you have invalid input, Python will often tend to crash rather than pass it through, though you can modify some of this by catching errors and/or passing in flags to request less strict behavior.
>>> json.loads(r'"\u123"')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape
>>> print(json.loads(r'"\udcff"'))
?
>>> #!? should probably crash or moan instead!
You seem to be asking about the first case, but the third can bite your behind badly if you don't make sure you have decided what to do with invalid Unicode data.
The second case would make a difference if you care about the JSON on disk being equivalent between versions; it doesn't seem to matter to you, but future visitors of this question might well care.
To some degree, yes, it should be safe. Note however that JSON is not defined in terms of byte strings, but rather it's defined in terms of Unicode text. That means before you do it json.parse, you need to decode that string first from whatever text encoding you're using. This Unicode encoding/decoding step may introduce inconsistencies.
The other implicit question you might have may be, will this process round trip. The answer to that is, it usually will, but it depends on the encoding/decoding process. Depending on the processing step, you may be normalising different characters that are considered equivalent in Unicode but composed using different code points. For example, accented characters like å may be encoded as composite characters using letter a and combining characters for the circle, or it may be encoded as the canonical code point of that character.
There's also the issue of JSON escape characters, which looks like "\u1234". Once decoded, Python doesn't preserve whether the characters is originally encoded using JSON escape or as Unicode character, so you'll lose that information as well and the text may not round trip fully in that case.
Apart from those issues in the deep corners of Unicode nerdery regarding equivalent characters and normalisation, encoding and decoding from/to JSON itself is pretty safe.

Do UTF-8 characters cover all encodings of ISO8859-xx and windows-12xx?

I am trying to write a generic document indexer from a bunch of documents with different encodings in python. I would like to know if it is possible to read all of my documents (that are encoded with utf-8,ISO8859-xx and windows-12xx) with utf-8 without character loss?
The reading part is as follows:
fin=codecs.open(doc_name, "r","utf-8");
doc_content=fin.read()
I'm going to rephrase your question slightly. I believe you are asking, "can I open a document and read it as if it were UTF-8, provided that it is actually intended to be ISO8869-xx or Windows-12xx, without loss?". This is what the Python code you've posted attempts to do.
The answer to that question is no. The Python code you posted will mangle the documents if they contain any characters above ordinal 127. This is because the "codepages" use the numbers from 128 to 255 to represent one character each, where UTF-8 uses that number range to proxy multibyte characters. So, each character in your document which is not in ASCII will be either interpreted as an invalid string or will be combined with the succeeding byte(s) to form a single UTF-8 codepoint, if you incorrectly parse the file as UTF-8.
As a concrete example, say your document is in Windows-1252. It contains the byte sequence 0xC3 0xAE, or "î" (A-tilde, registered trademark sign). In UTF-8, that same byte sequence represents one character, "ï" (small 'i' with diaresis). In Windows-874, that same sequence would be "รฎ". These are rather different strings - a moral insult could become an invitation to play chess, or vice versa. Meaning is lost.
Now, for a slightly different question - "can I losslessly convert my files from their current encoding to UTF-8?" or, "can I represent all the data from the current files as a UTF-8 bytestream?". The answer to these questions is (modulo a few fuzzy bits) yes. Unicode is designed to have a codepoint for every ideoglyph in any previously existing codepage, and by and large has succeeded in this goal. There are a few rough edges, but you will likely be well-served by using Unicode as your common interchange format (and UTF-8 is a good choice for a representation thereof).
However, to effect the conversion, you must already know and state the format in which the files exist as they are being read. Otherwise Python will incorrectly deal with non-ASCII characters and you will badly damage your text (irreparably, in fact, if you discard either the invalid-in-UTF8 sequences or the origin of a particular wrongly-converted byte range).
In the event that the text is all, 100% ASCII, you can open it as UTF-8 without a problem, as the first 127 codepoints are shared between the two representations.
UTF-8 covers everything in Unicode. I don't know for sure whether ISO-8859-xx and Windows-12xx are entirely covered by Unicode, but I strongly suspect they are.
I believe there are some encodings which include characters which aren't in Unicode, but I would be fairly surprised if you came across those characters. Covering the whole of Unicode is "good enough" for almost everything - that's the purpose of Unicode, after all. It's meant to cover everything we could possibly need (which is why it's grown :)
EDIT: As noted, you have to know the encoding of the file yourself, and state it - you can't just expect files to magically be read correctly. But once you do know the encoding, you could convert everything to UTF-8.
You'll need to have some way of determining which character set the document uses. You can't just open each one as "utf-8" and expect it to get magically converted. Open it with the proper character set, then convert.
The best way to be sure would be to convert a large set of documents, then convert them back and do a comparison.

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Python convert mixed ASCII code to String

I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]

"Broken" unicode strings encoded in UTF-8?

I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.

Categories

Resources