Python regex and Unicode [duplicate]

Python regex and Unicode [duplicate] - python

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 5 years ago.
I am currently trying to figure out how to use Unicode in a regex in Python.
The regex I want to get to work is the following:
r"([A-ZÜÖÄß]+\s)+"
This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.
For example, in FUßBALL AND MORE only BALL AND MORE should be detected.
I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.
If you are able to enlighten me, please feel free to do so.

Use Unicode strings. Make sure your source is saved in the declared encoding:
#coding:utf8
import re
for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
print s.group()
Output:
FUßBALL
AND
MORE
Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints >U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

Related

how can I extract only emoji from utf-8 with regex in python? [duplicate]

This question already has an answer here:
Find emojis in a tweet as whole clusters and not as individual chars
(1 answer)
Closed 11 months ago.
env python3.6
There's a utf-8 encoded text like this
text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"
And I want to search only elements which three numbers or alphabets follow b'\xf0\x9f\x98\' - this actually indicates the facial expression emojis.
I tried this
if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)
but it doesn't work and when I print it off it comes like this b'\xf0\x9f\x98\\[a-zA-Z1-9]{3}' and \ automatically gets in it.
Any way out? thanks.

I can see two problems with your search:
you are trying to search the textual representation of the utf8 string (the \xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
you are including the "end-of-string" marker ($) in your search, where you're probably interested in its occurrence anywhere in the string.
Something like the following should work, though brittle (see below for a more robust solution):
re.search(b'\xf0\x9f\x98.', text_utf8)
This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98.
Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).
A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:
regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))
This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)
Note that in this case you can also deal with text_utf8 as an actual unicode (str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was YalÃ§Ä±nkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "YalÃ§Ä±nkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')

Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular YalÃ§Ä±nkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

How to find right encoding in python? [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm trying to get rid of diacritics in my textfile. I converted a pdf to text with a tool, not made by myself. I wasn't able to understand which encoding they use. The text is written in Nahuatl, orthographically familiar with Spanish.
I transformed the text into a list of strings. No I'm trying to do the following:
# check whether there is a not-ascii character in the item
def is_ascii(word):
check = string.ascii_letters + "."
if word not in check:
return False
return True
# if there is a not ascii-character encode the string
def to_ascii(word):
if is_ascii(word) == False:
newWord = word.encode("utf8")
return newWord
return word
What I want to get is a unicode-version of my string. It doesn't work so far and I tried several encodings like latin1, cp1252, iso-8859-1. What I get is Can anybody tell me what I did wrong?
How can I find out the right encoding?
Thank you!
EDIT:
I wrote to the people that developed the converter (pdf-txt) and they said they were using unicode already. So John Machin was right with (1) in his answer.
As I wrote in some comment that wasn't clear to me, because in the Eclipse debugger the list itself showed some signs in unicodes, others not. And if I looked at the items seperately they were all decoded in some way, so that I actually saw unicode.
Thank you for your help!

Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.
You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8'). Note decode, NOT encode.
If decoding with UTF-8 does not raise UnicodeDecodeError and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.
If the above does not work, show us the result of print repr(the_text).
Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str objects and other as unicode is messy in Python 2.x and won't work in Python 3.X
In any case, your first function doesn't do what you think it does; it returns False for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.
Note that latin1 and iso-8859-1 are the same encoding. As latin1 encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError raised by text.decode('latin1'). "No error" is this case has exactly zero diagnostic value.
Update in response to this comment from OP:
I use Python 2.7. If I use text.decode("utf8") it raises the following
error: UnicodeEncodeError: 'latin-1' codec can't encode character
u'\u2014' in position 0: ordinal not in range(256).
That can happen two ways:
(1) In a single statement like foo = text.decode('utf8'), text is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).
(2) Possibly in two different statements, first foo = text.decode('utf8') where text is an str object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo and your sys.stdout.encoding is latin-1 (???).
I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!
Please edit your question to show your code (insert print repr(text) just before the text.decode("utf8") line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).
I ask again: can you make your file available for analysis?
By the way, u'\u2014' is an "EM DASH" and is a valid character in cp1252 (but not in latin-1, as you have seen from the error message). What version of what operating system are you using?
And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014' is enough evidence of that. Just show us your code and its results.

If you have read some bytes and want to interpret them as an unicode string, then you have to use .decode() rather than encode().
Like #delnan said in the comment, I hope you know the encoding. If not, the guesswork should go easy once you fix the function used.
BTW even if there are only ASCII characters in that word, why not .decode() it too? You'd have the same data type (unicode) everywhere, which will make your program simpler.

Unicode HTML Conversion to ASCII in Python [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Unescaping Characters in a String with Python
I have a string of unicode HTML in Python which begins with: \u003ctable>\u003ctr
I need to convert this to ascii so I can then parse it with BeautifulSoup. However, Python's encode and decode functions seem to have no effect; I get the original string no matter what I try. I'm new to Python and unicode in general, so help would be much appreciated.

Use
s.decode("unicode-escape")
to decode the html data first (no idea how you get this character crap from).

I have no clue what you're talking about. I suspect that I'm not the only one.
>>> s = BeautifulSoup.BeautifulSoup(u'<html><body>\u003ctable>\u003ctr</body></html>')
>>> s
<html><body><table><tr></tr></table></body></html>

Get warning for python string literals not prefixed with 'u'

To follow best practices for Unicode in python, you should prefix all string literals of characters with 'u'. Is there any tool available (preferably PyDev compatible) that warns if you forget it?

you should prefix all string literals with 'u'
No, not really.
You should prefix literals for strings of characters with u. But not all strings are strings of characters. When you are talking to components that are byte based, like network services, or binary files, you need to be using byte strings.
eg. Want to try to write a Unicode string into a PNG file? Not sensible. Want to base64-decode the string Y2Fm6Q==? You can't reasonably use a Unicode string here, base64 is explicitly bytes.
Sure, Python will often let you get away with passing a unicode string where a byte string is expected, but only by automatically encoding to ASCII. If the string contains non-ASCII characters you going to get UnicodeError just as surely as if you'd used bytes where unicode was expected. “Unicode is right, bytes are wrong” is a damaging myth. Manipulation for both kinds of strings are required.
If you are concerned about the transition to Python 3, you should certainly mark up your character strings as u'', but you should then also mark up your explicitly-bytes strings as b''. Strings where it doesn't matter you can leave as '' and let them get converted from byte strings to unicode strings on Python 3. There are lots of cases where Python 2 used to use bytes and Python 3 uses Unicode where it is appropriate to do this. But there are still plenty of cases where you do really need to be talking bytes, and having that converted to Python 3 as unicode will cause problems.
(The only problem with this is that b'' syntax requires Python 2.6 or later, so using it will make you incompatible with earlier versions.)

You might want to write a such a warnging-generator tool by parsing Python source code using the parser or the dis built-in modules. You may also consider adding such a feature to pylint.

KennyTM's comment should be posted as an answer:
from __future__ import unicode_literals
This future declaration can be used in Python 2.6 and 2.7 and enables Python 3's string syntax so that unprefixed string literals are Unicode strings and byte arrays require a b prefix.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.