Test for substrings with diacritics in strings - python

I'm testing in Python if certain string contains something as follows
if substr in str:
do_something()
The problem is when substr contains letter with diacritics and other non usual characters.
How would you recommend to do tests with such letters?
thank you

I do not know of any problems specific to diacritics in Python. The following works for me:
u"ł" in u"źdźbło"
>>> True
Edit:
u"ł" in u"źdźblo"
>>> False
The matching is exact. If diacritics-insensitive matching is what you want, specify this in your question and see Fredrik's answer.
Edit2: Yes, for string literals containing non-ascii chars you need to specify the encoding in the source file. Something like this should work:
# coding: utf-8

Use the solution outlined in this SO post to remove all diacritics prior to the testing.

Related

How do I replace all characters in a Python string?

I found a solution on stackoverflow but it doesn't seem to work. I have made a string scanner that checks for character frequency and then replaces all characters with the "real" characters. I've made sure that the character recognition works but when I try replacing all characters in a string they no longer match up with the expected/calculated characters (when I try replacing for example only 2 characters it works fine and matches up perfectly). Here is my replacement code:
print(text.replace(re,'e').replace(rt,'t').replace(ra,'a').replace(ro,'o').replace(ri,'i').replace(rn,'n').replace(rs,'s').replace(rr,'r').replace(rh,'h').replace(rl,'l').replace(ru,'u').replace(rc,'c').replace(rm,'m').replace(rf,'f').replace(ry,'y').replace(rw,'w').replace(rg,'g').replace(rp,'p').replace(rb,'b').replace(rv,'v').replace(rk,'k').replace(rx,'x').replace(rq,'q').replace(rj,'j').replace(rz,'z').replace(rd,'d'))
You might want to take a look at translate. Your code would probably look something like
text = text.translate(str.maketrans('abcd...', ''.join([ra, rb, rc, rd...]))

find all the matches for unicodes in a string in python

import re
b="united thats weak. See ya 👋"
print b.decode('utf-8') #output: u'united thats weak. See ya \U0001f44b'
print re.findall(r'[\U0001f600-\U0001f650]',b.decode('utf-8'),flags=re.U) # output: [u'S']
How to get a output \U0001f44b. Please help
Emojis that i need to handle are "😀❤️😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏🚀🚁🚂🚃🚄🚅🚆🚇🚈🚉🚊🚋🚌🚍🚎🚏🚐🚑🚒🚓🚔🚕🚖🚗🚘🚙🚚🚛🚜🚝🚞🚟🚠🚡🚢🚣🚤🚥🚦🚧🚨🚩🚪🚫🚬🚭🚮🚯🚰🚱🚲🚳🚴🚵🚶🚷🚸🚹🚺🚻🚼🚽🚾🚿🛀🛁🛂🛃🛄🛅🛋🛌🛍🛎🛏🛐🛠🛡🛢🛣🛤🛥🛩🛫🛬🛰🛳🤐🤑🤒🤓🤔🤕🤖🤗🤘🦀🦁🦂🦃🦄🧀"
Searching for a unicode range works exactly the same as searching for any sort of character range. But, you'll need to represent the strings correctly. Here is a working example:
#coding: utf-8
import re
b=u"united thats weak. See ya 😇 "
assert re.findall(u'[\U0001f600-\U0001f650]',b) == [u'😇']
assert re.findall(ur'[😀-🙏]',b) == [u'😇']
Notes:
You need #coding: utf-8 or similar on the first or second line of your program.
In your example, the emoji that you used, U-1f44b is not in the range U-1f600 to U-1f650. In my example, I used one that is.
If you want to use \U to include a unicode character, you can't use the raw string prefix (r'').
But if you use the characters themselves (instead of \U escapes), then you can use the raw string prefix.
You need to ensure that both the pattern and the input string are unicode strings. Neither of them may be UTF8-encoded strings.
But you don't need the re.U flag unless your pattern includes \s, \w, or similar.

Python regex sub function with matching group

I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.
I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.
I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.
Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

How can I use regular expression for unicode string in python?

Hi I wanna use regular expression for unicode utf-8 in following string:
</td><td>عـــــــــــادي</td><td> 40.00</td>
I want to pick "عـــــــــــادي" out, how Can I do this?
My code for this is :
state = re.findall(r'td>...</td',s)
Thanks
I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.
>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"
Make your string unicode by placing a u before the quotation marks
>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)
Set the flag to unicode, so that it will match unicode strings as well (see docs).
(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:
pattern = re.compile(ur'>([а-яА-Я\s]+)<')
In that case, you don't have to set a flag anymore, since you're not using a special sequence.)
>>> match = pattern.findall(string)
>>> for i in match:
... print i
...
Я люблю мороженое
According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:
# -*- coding: utf-8 -*-
Furthermore, try adding 'ur' before the string so that it's raw and Unicode:
state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)
I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

Categories

Resources