Same character, different length and bytes [duplicate] - python

This question already has answers here:
python different length for same unicode
(2 answers)
Closed 5 years ago.
Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:
>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>
Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:
What is the difference between these two strings?
Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
And how can I fix this? (e.g. convert second_string to the likeness of first_string)
Thank you.

An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.
The first one is:
<â> 226, Hex 00e2, Octal 342
And the second:
<a> 97, Hex 61, Octal 141 < ̂> 770, Hex 0302, Octal 1402
In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.
Ask the website operators. How would we know?!
You need something which turns combining characters into regular characters. A Google search yielded this question, for example.
As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.

There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.
A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.
You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:
from unicodedata import normalize
s = u'a\u0302'
print s, len(s), len(normalize("NFC", s))
will output â 2 1.

Related

Regarding a problem with python 3 in hex-string conversion

I have found out a problem with python 3.6.7 when I tried to stringfy a hexadecimal value. The original hexadecimal number in the string is wrongly converted into an acsii letter Ë. Is there any way for solving this?
>>> '\xcb\x85\x04\x08'
'Ë\x85\x04\x08'
You are using characters outside of the ascii code. If you are trying to use unicode, use \u____.
print("\xCB\x85\x04\x08")
print("\uCB89\u0408")
Output:
Ë
쮉Ј
You can find an ascii table at asciitable.com. Characters outside of the range 00-7F are subject to variance across regions due to the use by many countries to store extra characters that are useful in their common language, such as russian characters in russia.

how can I extract only emoji from utf-8 with regex in python? [duplicate]

This question already has an answer here:
Find emojis in a tweet as whole clusters and not as individual chars
(1 answer)
Closed 11 months ago.
env python3.6
There's a utf-8 encoded text like this
text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"
And I want to search only elements which three numbers or alphabets follow b'\xf0\x9f\x98\' - this actually indicates the facial expression emojis.
I tried this
if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)
but it doesn't work and when I print it off it comes like this b'\xf0\x9f\x98\\[a-zA-Z1-9]{3}' and \ automatically gets in it.
Any way out? thanks.
I can see two problems with your search:
you are trying to search the textual representation of the utf8 string (the \xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
you are including the "end-of-string" marker ($) in your search, where you're probably interested in its occurrence anywhere in the string.
Something like the following should work, though brittle (see below for a more robust solution):
re.search(b'\xf0\x9f\x98.', text_utf8)
This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98.
Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).
A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:
regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))
This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)
Note that in this case you can also deal with text_utf8 as an actual unicode (str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

Python. Replace "\\uxxxx" to "\uxxxx" [duplicate]

This question already has an answer here:
Python-encoding and decoding using codecs,unicode_escape()
(1 answer)
Closed 4 years ago.
I'm scraping a web, and I get Unicode characters as raw.
Instead of getting the "ó" character, I get \u00f3.
It is the same as write:
>>>print("\\u00f3")
I want to convert "\\u00f3" into "\u00f3" in all unicode characters. It is:
"\\uxxxx" -> "\uxxxx"
But if I try to replace \\ by \, next characters are interpreted as escape characters.
How can I do it?
Applying the next code, I can convert part of the characters:
def raw_to_utf8(matcher):
string2convert = matcher.group(0)
return(chr(int(string2convert[2:],base=16)))
def decode_utf8(text_raw):
text_raw_re=re.compile(r"\\u[0-9a-ce-z]\w{0,3}")
return text_raw_re.sub(raw_to_utf8, text_raw)
text_fixed = decode_utf8(text_raw)
As you can see in the regular expression pattern, I have skipped the 'd' character. It is because \udxxx characters can't be converted in UTF-8 by this metod and any other one. They aren't important characters for me, so it is not a problem.
Thanks for your help.
************************** Solved ********************************
The best solution was solved previously:
Python-encoding and decoding using codecs,unicode_escape()
Thanks for your help.
First: maybe you are not decoding the webpage with the correct charset. If the web server does not supply the charset you might have to find it in the meta tags or make an educated guess. Maybe try a couple of usual charsets and compare the results.
Second: I played around with strings and decoding for a while and it's really frustrating, but I found a possible solution in format():
s = "\\u00f3"
print('{:c}'.format(int(s[2:], 16)))
Format the extracted hex value as unicode seems to work.
You cannot replace '\\' by '\' because '\' is not a valid literal string.
Convert the hexadecimal expression into a number, then find the corresponding character:
original = '\\u00f3'
char = chr(int(original[2:], base=16))
You can check that this gives the desired result:
assert char == '\u00f3'

How many displayable characters in a unicode string (Japanese / Chinese)

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?
This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.
If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "Yalçınkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

Categories

Resources