How can I use regular expression for unicode string in python?

How can I use regular expression for unicode string in python? - python

Hi I wanna use regular expression for unicode utf-8 in following string:
</td><td>عـــــــــــادي</td><td> 40.00</td>
I want to pick "عـــــــــــادي" out, how Can I do this?
My code for this is :
state = re.findall(r'td>...</td',s)
Thanks

I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.
>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"
Make your string unicode by placing a u before the quotation marks
>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)
Set the flag to unicode, so that it will match unicode strings as well (see docs).
(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:
pattern = re.compile(ur'>([а-яА-Я\s]+)<')
In that case, you don't have to set a flag anymore, since you're not using a special sequence.)
>>> match = pattern.findall(string)
>>> for i in match:
... print i
...
Я люблю мороженое

According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:
# -*- coding: utf-8 -*-
Furthermore, try adding 'ur' before the string so that it's raw and Unicode:
state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)
I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

Related

find all the matches for unicodes in a string in python

import re
b="united thats weak. See ya 👋"
print b.decode('utf-8') #output: u'united thats weak. See ya \U0001f44b'
print re.findall(r'[\U0001f600-\U0001f650]',b.decode('utf-8'),flags=re.U) # output: [u'S']
How to get a output \U0001f44b. Please help
Emojis that i need to handle are "😀❤️😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏🚀🚁🚂🚃🚄🚅🚆🚇🚈🚉🚊🚋🚌🚍🚎🚏🚐🚑🚒🚓🚔🚕🚖🚗🚘🚙🚚🚛🚜🚝🚞🚟🚠🚡🚢🚣🚤🚥🚦🚧🚨🚩🚪🚫🚬🚭🚮🚯🚰🚱🚲🚳🚴🚵🚶🚷🚸🚹🚺🚻🚼🚽🚾🚿🛀🛁🛂🛃🛄🛅🛋🛌🛍🛎🛏🛐🛠🛡🛢🛣🛤🛥🛩🛫🛬🛰🛳🤐🤑🤒🤓🤔🤕🤖🤗🤘🦀🦁🦂🦃🦄🧀"

Searching for a unicode range works exactly the same as searching for any sort of character range. But, you'll need to represent the strings correctly. Here is a working example:
#coding: utf-8
import re
b=u"united thats weak. See ya 😇 "
assert re.findall(u'[\U0001f600-\U0001f650]',b) == [u'😇']
assert re.findall(ur'[😀-🙏]',b) == [u'😇']
Notes:
You need #coding: utf-8 or similar on the first or second line of your program.
In your example, the emoji that you used, U-1f44b is not in the range U-1f600 to U-1f650. In my example, I used one that is.
If you want to use \U to include a unicode character, you can't use the raw string prefix (r'').
But if you use the characters themselves (instead of \U escapes), then you can use the raw string prefix.
You need to ensure that both the pattern and the input string are unicode strings. Neither of them may be UTF8-encoded strings.
But you don't need the re.U flag unless your pattern includes \s, \w, or similar.

Regular expression including and excluding characters

I have the following regular expression that almost works fine.
WORD_REGEXP = re.compile(r"[a-zA-Zá-úÁ-Úñ]+")
It includes lower and upper case letters with and without an accent plus the Spanish letter «ñ». Unfortunately, it also includes (I don't know why) characters that are also used in Spanish like «¡» or «¿» which I would like to remove as well.
In a line like ¡España, olé! I would like to extract just España and olé, by means of the regular expression.
How can I exclude these two characters («¿», «¡») in the regular expression?
According to stribizhe, it seems as if the regex was OK. So the problem must be other. I include the full Python code:
import re
linea = "¡Arriba Éspáña, ¿olé!"
WORD_REGEXP = re.compile(r"([a-zA-Zá-úÁ-Úñ]+)", re.UNICODE)
palabras = WORD_REGEXP.findall(linea)
for pal in palabras:
pal = unicode(pal,'latin1').encode('latin1', 'replace')
print pal
The result is the following:
¡Arriba
Éspáña
¿olé

Use the special sequence '\w', according to documentation:
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Note, however that your string must be a unicode string:
import re
linea = u"¡Arriba Éspáña, ¿olé!"
regex = re.compile(r"\w+", re.UNICODE)
regex.findall(linea)
# [u'Arriba', u'\xc9sp\xe1\xf1a', u'ol\xe9']
NOTE: The cause of your error is that your regex is being interpreted as UTF-8, e.g.:
You pattern r'([a-zA-Zá-úÁ-Úñ]+)' is not defined as a unicode string, so it's encoded to UTF-8 by your text editor and read by python as '([a-zA-Z\xc3\xa1-\xc3\xba\xc3\x81-\xc3\x9a\xc3\xb1]+)', note the patterns starting with \xc3 (that is the unicode start byte).
You can confirm that by printing the repr of WORD_REGEXP. So the actual pattern used by the re module is:
patt = r"([a-zA-Zá-úÁ-Úñ]+)"
print patt.decode('latin1')
Or:
a-z
A-Z
\xc3
\xa1-\xc3
\xba
\xc3
\x81-\xc3
\x9a
\xc3
\xb1
Simplifying it, you are actually using pattern
a-zA-Z\x81-\xc3
That last range, covers a lot of characters!!

It's better to use code points. The codepoint's for those characters are
¡ - \x{A1}
¿ - \x{BF}
which seem to fall outside the range of your accent characters.
[a-zA-Z\x{E1}-\x{FA}\x{C1}-\x{DA}\x{F1}]+

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.

I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.

I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.

Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

Test for substrings with diacritics in strings

I'm testing in Python if certain string contains something as follows
if substr in str:
do_something()
The problem is when substr contains letter with diacritics and other non usual characters.
How would you recommend to do tests with such letters?
thank you

I do not know of any problems specific to diacritics in Python. The following works for me:
u"ł" in u"źdźbło"
>>> True
Edit:
u"ł" in u"źdźblo"
>>> False
The matching is exact. If diacritics-insensitive matching is what you want, specify this in your question and see Fredrik's answer.
Edit2: Yes, for string literals containing non-ascii chars you need to specify the encoding in the source file. Something like this should work:
# coding: utf-8

Use the solution outlined in this SO post to remove all diacritics prior to the testing.

Python regex \w doesn't match combining diacritics?

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.
>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz
(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)
Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.

I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).
It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).
Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.
It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).
See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).

You can use unicodedata.normalize to compose the combining diacritics into one unicode character.
>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>
I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I use regular expression for unicode string in python? - python

Hi I wanna use regular expression for unicode utf-8 in following string: </td><td>عـــــــــــادي</td><td> 40.00</td> I want to pick "عـــــــــــادي" out, how Can I do this? My code for this is : state = re.findall(r'td>...</td',s) Thanks

Related

find all the matches for unicodes in a string in python

Regular expression including and excluding characters

Accommodate two types of quotes in a regex

Test for substrings with diacritics in strings

Python regex \w doesn't match combining diacritics?

Categories

Resources