Checking if string contains unicode using standard Python

Checking if string contains unicode using standard Python - python

I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.
I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).
If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:
re.compile(
u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
flags=re.UNICODE)
where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.
Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.
Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

Related

Printing non ASCII glyphs in python

In HTML-CSS, we put non-ascii glyphs like middle dots, copyright symbol etc, by using their numeric conventions. In order to use non-ASCII characters, Python requires explicit encoding and decoding of strings into Unicode.
I have tried using unidecode lib (from reference here), but I am having trouble printing these characters.
I have tried different conventions for the symbol:
U+25CF , ● [&#9679 with ';'] , and so on... (depending on variation of these glyphs)n For example sake, help me print same dot(above) in python
I want to know how to print these in python, for I have such requirement in a GUI project, made with kivy/kivymd in python.

print(chr(9679))
Just use the decimal part of the entity.
Using a special escape sequence you can embed the character in a string using the hex.
print('\u25cf')

String change from Latin to ASCII

I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')

It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)

Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3?

I'm using Python 3.7 and Django 2.0. I want to strip out non-UTF-8 characters from a string, that I'm obtaining by reading this CSV file. I tried this ...
web_site = row['website'].strip().encode("utf-8", 'ignore').decode("utf-8")
but this doesn't seem to be doing the job, since I have a resulting string that looks like ...
web_site: "wbez.org<200e>"
Whatever this "<200e>" thing is, is evidently non-UTF-8 string, because when I try and insert this into a MySQL database (deployed as a docker image), I get the following error ...
web_1 | django.db.utils.OperationalError: Problem installing fixture '/app/maps/fixtures/seed_data.yaml': Could not load maps.Coop(pk=191): (1366, "Incorrect string value: '\\xE2\\x80\\x8E' for column 'web_site' at row 1")

Your row['website'] is already a Unicode string. UTF-8 can support all valid Unicode code points, so .encode('utf8','ignore') doesn't typically ignore anything and encodes the entire string in UTF-8, and .decode('utf8') changes it back to a Unicode string again.
If you simply want to strip non-ASCII characters, use the following to filter only ASCII characters and ignore the rest.
row['website'].encode('ascii','ignore').decode('ascii')

I think you are confusing the encodings.
Python has a standard character set: Unicode
UTF-8 is just and encoding of Unicode. All characters in Unicode can be encoded in UTF-8, and all valid UTF-8 codes can be interpreted as unicode characters.
So you are just encoding and decoding Unicode strings, so the code should do nothing. (There is really some exceptional cases: Python strings really are a superset of Unicode, so your code would just remove non Unicode characters, see surrogateescape, for such extremely seldom case, usually you will enconter only by reading sys.argv or os.environ).
In any case, I think you are doing thing wrong. Search in this site for the general question (e.g. "remove non-ascii characters"). It is often better to decompose (with K, compatibility), and then remove accent, and then remove non-ascii characters, so that you will get more characters translated. There are various function to create slug, which do a better job, or there is also a library which translate more characters in "nearly equivalent" ascii characters (Unicode has various representation of LETTER A, and you may want to translate also Alpha and Aleph and ... into A (better then discarding, especially if you have a foreign language, which possibly you will discard everything).

how can I extract only emoji from utf-8 with regex in python? [duplicate]

This question already has an answer here:
Find emojis in a tweet as whole clusters and not as individual chars
(1 answer)
Closed 11 months ago.
env python3.6
There's a utf-8 encoded text like this
text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"
And I want to search only elements which three numbers or alphabets follow b'\xf0\x9f\x98\' - this actually indicates the facial expression emojis.
I tried this
if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)
but it doesn't work and when I print it off it comes like this b'\xf0\x9f\x98\\[a-zA-Z1-9]{3}' and \ automatically gets in it.
Any way out? thanks.

I can see two problems with your search:
you are trying to search the textual representation of the utf8 string (the \xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
you are including the "end-of-string" marker ($) in your search, where you're probably interested in its occurrence anywhere in the string.
Something like the following should work, though brittle (see below for a more robust solution):
re.search(b'\xf0\x9f\x98.', text_utf8)
This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98.
Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).
A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:
regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))
This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)
Note that in this case you can also deal with text_utf8 as an actual unicode (str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

Get warning for python string literals not prefixed with 'u'

To follow best practices for Unicode in python, you should prefix all string literals of characters with 'u'. Is there any tool available (preferably PyDev compatible) that warns if you forget it?

you should prefix all string literals with 'u'
No, not really.
You should prefix literals for strings of characters with u. But not all strings are strings of characters. When you are talking to components that are byte based, like network services, or binary files, you need to be using byte strings.
eg. Want to try to write a Unicode string into a PNG file? Not sensible. Want to base64-decode the string Y2Fm6Q==? You can't reasonably use a Unicode string here, base64 is explicitly bytes.
Sure, Python will often let you get away with passing a unicode string where a byte string is expected, but only by automatically encoding to ASCII. If the string contains non-ASCII characters you going to get UnicodeError just as surely as if you'd used bytes where unicode was expected. “Unicode is right, bytes are wrong” is a damaging myth. Manipulation for both kinds of strings are required.
If you are concerned about the transition to Python 3, you should certainly mark up your character strings as u'', but you should then also mark up your explicitly-bytes strings as b''. Strings where it doesn't matter you can leave as '' and let them get converted from byte strings to unicode strings on Python 3. There are lots of cases where Python 2 used to use bytes and Python 3 uses Unicode where it is appropriate to do this. But there are still plenty of cases where you do really need to be talking bytes, and having that converted to Python 3 as unicode will cause problems.
(The only problem with this is that b'' syntax requires Python 2.6 or later, so using it will make you incompatible with earlier versions.)

You might want to write a such a warnging-generator tool by parsing Python source code using the parser or the dis built-in modules. You may also consider adding such a feature to pylint.

KennyTM's comment should be posted as an answer:
from __future__ import unicode_literals
This future declaration can be used in Python 2.6 and 2.7 and enables Python 3's string syntax so that unprefixed string literals are Unicode strings and byte arrays require a b prefix.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.