find all the matches for unicodes in a string in python - python

import re
b="united thats weak. See ya 👋"
print b.decode('utf-8') #output: u'united thats weak. See ya \U0001f44b'
print re.findall(r'[\U0001f600-\U0001f650]',b.decode('utf-8'),flags=re.U) # output: [u'S']
How to get a output \U0001f44b. Please help
Emojis that i need to handle are "😀❤️😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏🚀🚁🚂🚃🚄🚅🚆🚇🚈🚉🚊🚋🚌🚍🚎🚏🚐🚑🚒🚓🚔🚕🚖🚗🚘🚙🚚🚛🚜🚝🚞🚟🚠🚡🚢🚣🚤🚥🚦🚧🚨🚩🚪🚫🚬🚭🚮🚯🚰🚱🚲🚳🚴🚵🚶🚷🚸🚹🚺🚻🚼🚽🚾🚿🛀🛁🛂🛃🛄🛅🛋🛌🛍🛎🛏🛐🛠🛡🛢🛣🛤🛥🛩🛫🛬🛰🛳🤐🤑🤒🤓🤔🤕🤖🤗🤘🦀🦁🦂🦃🦄🧀"

Searching for a unicode range works exactly the same as searching for any sort of character range. But, you'll need to represent the strings correctly. Here is a working example:
#coding: utf-8
import re
b=u"united thats weak. See ya 😇 "
assert re.findall(u'[\U0001f600-\U0001f650]',b) == [u'😇']
assert re.findall(ur'[😀-🙏]',b) == [u'😇']
Notes:
You need #coding: utf-8 or similar on the first or second line of your program.
In your example, the emoji that you used, U-1f44b is not in the range U-1f600 to U-1f650. In my example, I used one that is.
If you want to use \U to include a unicode character, you can't use the raw string prefix (r'').
But if you use the characters themselves (instead of \U escapes), then you can use the raw string prefix.
You need to ensure that both the pattern and the input string are unicode strings. Neither of them may be UTF8-encoded strings.
But you don't need the re.U flag unless your pattern includes \s, \w, or similar.

Related

Regular expression including and excluding characters

I have the following regular expression that almost works fine.
WORD_REGEXP = re.compile(r"[a-zA-Zá-úÁ-Úñ]+")
It includes lower and upper case letters with and without an accent plus the Spanish letter «ñ». Unfortunately, it also includes (I don't know why) characters that are also used in Spanish like «¡» or «¿» which I would like to remove as well.
In a line like ¡España, olé! I would like to extract just España and olé, by means of the regular expression.
How can I exclude these two characters («¿», «¡») in the regular expression?
According to stribizhe, it seems as if the regex was OK. So the problem must be other. I include the full Python code:
import re
linea = "¡Arriba Éspáña, ¿olé!"
WORD_REGEXP = re.compile(r"([a-zA-Zá-úÁ-Úñ]+)", re.UNICODE)
palabras = WORD_REGEXP.findall(linea)
for pal in palabras:
pal = unicode(pal,'latin1').encode('latin1', 'replace')
print pal
The result is the following:
¡Arriba
Éspáña
¿olé
Use the special sequence '\w', according to documentation:
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Note, however that your string must be a unicode string:
import re
linea = u"¡Arriba Éspáña, ¿olé!"
regex = re.compile(r"\w+", re.UNICODE)
regex.findall(linea)
# [u'Arriba', u'\xc9sp\xe1\xf1a', u'ol\xe9']
NOTE: The cause of your error is that your regex is being interpreted as UTF-8, e.g.:
You pattern r'([a-zA-Zá-úÁ-Úñ]+)' is not defined as a unicode string, so it's encoded to UTF-8 by your text editor and read by python as '([a-zA-Z\xc3\xa1-\xc3\xba\xc3\x81-\xc3\x9a\xc3\xb1]+)', note the patterns starting with \xc3 (that is the unicode start byte).
You can confirm that by printing the repr of WORD_REGEXP. So the actual pattern used by the re module is:
patt = r"([a-zA-Zá-úÁ-Úñ]+)"
print patt.decode('latin1')
Or:
a-z
A-Z
\xc3
\xa1-\xc3
\xba
\xc3
\x81-\xc3
\x9a
\xc3
\xb1
Simplifying it, you are actually using pattern
a-zA-Z\x81-\xc3
That last range, covers a lot of characters!!
It's better to use code points. The codepoint's for those characters are
¡ - \x{A1}
¿ - \x{BF}
which seem to fall outside the range of your accent characters.
[a-zA-Z\x{E1}-\x{FA}\x{C1}-\x{DA}\x{F1}]+

How to remove this \xa0 from a string in python?

I have the following string:
word = u'Buffalo,\xa0IL\xa060625'
I don't want the "\xa0" in there. How can I get rid of it? The string I want is:
word = 'Buffalo, IL 06025
The most robust way would be to use the unidecode module to convert all non-ASCII characters to their closest ASCII equivalent automatically.
The character \xa0 (not \xa as you stated) is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.
import unidecode
word = unidecode.unidecode(word)
If you know for sure that is the only character you don't want, you can .replace it:
>>> word.replace(u'\xa0', ' ')
u'Buffalo, IL 60625'
If you need to handle all non-ascii characters, encoding and replacing bad characters might be a good start...:
>>> word.encode('ascii', 'replace')
'Buffalo,?IL?60625'
There is no \xa there. If you try to put that into a string literal, you're going to get a syntax error if you're lucky, or it's going to swallow up the next attempted character if you're not, because \x sequences aways have to be followed by two hexadecimal digits.
What you have is \xa0, which is an escape sequence for the character U+00A0, aka "NO-BREAK SPACE".
I think you want to replace them with spaces, but whatever you want to do is pretty easy to write:
word.replace(u'\xa0', u' ') # replaced with space
word.replace(u'\xa0', u'0') # closest to what you were literally asking for
word.replace(u'\xa0', u'') # removed completely
You can easily use unicodedata to get rid of all of \x... characters.
from unicodedata import normalize
normalize('NFKD', word)
>>> 'Buffalo, IL 60625'
This seems to work for getting rid of non-ascii characters:
fixedword = word.encode('ascii','ignore')

Python regex sub function with matching group

I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.

How can I use regular expression for unicode string in python?

Hi I wanna use regular expression for unicode utf-8 in following string:
</td><td>عـــــــــــادي</td><td> 40.00</td>
I want to pick "عـــــــــــادي" out, how Can I do this?
My code for this is :
state = re.findall(r'td>...</td',s)
Thanks
I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.
>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"
Make your string unicode by placing a u before the quotation marks
>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)
Set the flag to unicode, so that it will match unicode strings as well (see docs).
(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:
pattern = re.compile(ur'>([а-яА-Я\s]+)<')
In that case, you don't have to set a flag anymore, since you're not using a special sequence.)
>>> match = pattern.findall(string)
>>> for i in match:
... print i
...
Я люблю мороженое
According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:
# -*- coding: utf-8 -*-
Furthermore, try adding 'ur' before the string so that it's raw and Unicode:
state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)
I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

python re codecs ä ö, finnish language, define that is word

IS it possible to define that specific languages characters would be considered as word.
I.e. re do not accept ä,ö as word characters if i search them in following way
Ft=codecs.open('c:\\Python27\\Scripts\\finnish2\\textfields.txt','r','utf–8')
word=Ft.readlines()
word=smart_str(word, encoding='utf-8', strings_only=False, errors='replace')
word=re.sub('[^äÄöÖåÅA-Za-z0-9]',"""\[^A-Za-z0-9]*""", word) ; print 'word= ', word #works in skipping ö,ä,å characters
I would like that these character would be included to [A-Za-z].
How to define this?
[A-Za-z0-9] will only match the characters listed here, but the docs also mention some other special constructs like:
\w which stands for alphanumeric characters (namely [a-zA-Z0-9_] plus all unicode characters which are declared to be alphanumeric
\W which stands for all nun-alphanumeric characters [^a-zA-Z0-9_] plus unicode
\d which stands for digits
\b which matches word boundaries (including all rules from the unicode tables)
So, you will to (a) use this constructs instead (which are shorter and maybe easier to read), and (b) tell re that you want to "localize" those strings with the current locale by setting the UNICODE flag like:
re_word = re.compile(r'\w+', re.U)
For a start, you appear to be slightly confused about the args for re.sub.
The first arg is the pattern. You have '[^äÄöÖåÅA-Za-z0-9]' which matches each character which is NOT in the Finnish alphabet nor a digit.
The second arg is the replacement. You have """[^A-Za-z0-9]*""" ... so each of those non-Finnish-alphanumeric characters is going to be replaced by the literal string [^A-Za-z0-9]*. It's reasonable to assume that this is not what you want.
What do you want to do?
You need to explain your third line; after your first 2 lines, word will be a list of unicode objects, which is A Good Thing. However the encoding= and the errors= indicate that the unknown (to us) smart_str() is converting your lovely unicode back to UTF-8. Processing data in UTF-8 bytes instead of Unicode characters is EVIL, unless you know what you are doing.
What encoding directive do you have at the top of your source file?
Advice: Get your data into unicode. Work on it in unicode. All your string constants should have the u prefix; if you consider that too much wear and tear on your typing fingers, at least put it on the non-ASCII constants e.g. u'[^äÄöÖåÅA-Za-z0-9]'. When you have done all the processing, encode your results for display or storage using an appropriate encoding.
When working with re, consider \w which will match any alphanumeric (and also the underscore) instead of listing out what is alphabetic in one language. Do use the re.UNICODE flag; docs here.
Something like this might do the trick:
pattern = re.compile("(?u)pattern")
or
pattern = re.compile("pattern", re.UNICODE)

Categories

Resources