Why does this Python RegEx pipe not pick out both unicode ranges? - python

A sample string containing both hiragana and katakana unicode characters:
myString = u"Eliminate ひらがな non-alphabetic カタカナ characters"
A pattern to match both ranges, according to:
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
myPattern = u"[\u3041-\u309f]*|[\u30a0-\u30ff]*"
Simple Python regex replace function
import re
print re.sub(myPattern, "", myString)
Returns:
Eliminate non-alphabetic カタカナ characters
The only way I can get it to work is if I use the two ranges separately, one after the other. What is stopping this RegEx from simply picking both sides of the |-pipe?

You'll need to combine the ranges into one character class, otherwise it will match one or the other range, not both:
myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
Demo:
>>> myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
>>> print re.sub(myPattern, "", u"Eliminate ひらがな non-alphabetic カタカナ characters")
Eliminate non-alphabetic characters

>>> myPattern = u"[\u3041-\u309f]|[\u30a0-\u30ff]"
>>> print re.sub(myPattern, "", myString)
Eliminate non-alphabetic characters
>>>
EDIT you can combine the two character classes with the OR operator as well

Related

Easiest way to replace a substring

What would be the easiest way to replace a substring within a string when I don't know the exact substring I am looking for and only know the delimiting strings? For example, if I have the following:
mystr = 'wordone wordtwo "In quotes"."Another word"'
I basically want to delete the first quoted words (including the quotes) and the period (.) following so the resulting string is:
'wordone wordtwo "Another word"'
Basically I want to delete the first quoted words and the quotes and the following period.
You are looking for regular expressions here, using the re module:
import re
quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
result = quoted_plus_fullstop.sub('', mystr)
The pattern matches a literal quote, followed by 1 or more characters that are not quotes, followed by another quote and a full stop.
Demo:
>>> import re
>>> mystr = 'wordone wordtwo "In quotes"."Another word"'
>>> quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
>>> quoted_plus_fullstop.sub('', mystr)
'wordone wordtwo "Another word"'

Regex to get list of all words with specific letters (unicode graphemes)

I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).
I need to draw out the subset of those words that can be spelled using just those letters.
An English example:
words = ["cat", "dog", "tack", "coat"]
get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]
A Tamil example:
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words([u'ம', u'ப', u'ட', u'ம்') should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]
The order that the words are returned in, or the order that the letters are entered in should not make a difference.
Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.
In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).
To support characters that can span several Unicode codepoints:
# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial
NFKD = partial(unicodedata.normalize, 'NFKD')
def match(word, letters):
word, letters = NFKD(word), map(NFKD, letters) # normalize
return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)
words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]
print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்
It assumes that the same character can be used zero or more times in a word.
If you want only words that contain exactly given characters:
import regex # $ pip install regex
chars = regex.compile(r"\X").findall # get all characters
def match(word, letters):
return sorted(chars(word)) == sorted(letters)
words = ["cat", "dog", "tack", "coat"]
print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack
Note: there is no cat in the output in this case because cat doesn't use all given characters.
What does it mean to normalize? And could you please explain the syntax of the re.match() regex?
>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç
Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.
EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then #Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b, but I don't really expect anyone to be.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.
Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.
To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)
So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)
Here is the pattern for end of word. White space or end of line: `(?:\s|$)
Putting it all together, here is our final pattern:
(?:^|\s)([ocat]+)(?:\s|$)
You can build this dynamically. You don't need to hard-code the whole thing.
import re
s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"
s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)
Now, this doesn't in any way check for valid words. If you have the following text:
This is sensible. This not: occo cttc
The pattern I showed you will match on occo and cttc, and those are not really words. They are strings made only of letters from [ocat] though.
So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.
This has a confusing problem: re.findall() doesn't return all possible matches.
EDIT: Okay, I figured out what was confusing me.
What we want is for our pattern to work with re.findall() so you can collect all the words. But re.findall() only finds non-overlapping patterns. In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo. The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc.
The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:
Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".
Here is that pattern using ocat:
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm very sorry but I really do think this is the best we can do and still work with re.findall()!
It's actually less confusing in Python code though:
import re
NMGROUP_BEGIN = r'(?:' # begin non-matching group
NMGROUP_END = r')' # end non-matching group
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
PAT_OR = r'|'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
s_pat = (NMGROUP_BEGIN +
BOL + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + EOL + PAT_OR +
BOL + CCS + EOL +
NMGROUP_END)
pat = re.compile(s_pat)
text = "This is sensible. This not: occo cttc"
pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]
So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:
import itertools as it
raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']
I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together.
EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.
import re
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
lst_s_pat = [
BOL + CCS + WS_AFTER,
WS_BEFORE + CCS + WS_AFTER,
WS_BEFORE + CCS + EOL,
BOL + CCS
]
lst_pat = [re.compile(s) for s in lst_s_pat]
text = "This is sensible. This not: occo cttc"
result = []
for pat in lst_pat:
result.extend(pat.findall(text))
# result set to: ['occo', 'cttc']
EDIT: Okay, here is a very different approach. I like this one best.
First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.
Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.
import re
import string
# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.
WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')
WORD = r'[^' + WORD_BOUNDARY + r']+'
# Create a pattern that matches only the words we want.
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'
pat_word = re.compile(WORD)
pat = re.compile(CCS)
text = "This is sensible.  This not: occo cttc"
# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]
# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))
# force the expression to expand out to a list
result = list(result_genexp)
# result set to: ['occo', 'cttc']
EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b, for the best solution in Python.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.
[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much #Mark Tolonen for educating me about\b`.)
So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. You can use this pattern with re.findall() or re.finditer().
import re
words = ["cat", "dog", "tack", "coat"]
def get_words(chars_seq, words_seq=words):
s_chars = ''.join(chars_seq)
s_pat = r'\b[' + s_chars + r']+\b'
pat = re.compile(s_pat)
return [word for word in words_seq if pat.match(word)]
assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]
I would not use regular expressions to solve this problem. I would rather use collections.Counter like so:
>>> from collections import Counter
>>> def get_words(word_list, letter_string):
return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']
This solution should also work for other languages. Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). If this intersection is equivalent to your word, then you want to return the word as a match.

Python remove anything that is not a letter or number

I'm having a little trouble with Python regular expressions.
What is a good way to remove all characters in a string that are not letters or numbers?
Thanks!
[\w] matches (alphanumeric or underscore).
[\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
You need [\W_] to remove ALL non-alphanumerics.
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.
Now all you need is to define alphanumerics:
str object, only ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str object, only locale-defined alphanumerics:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode object, all alphanumerics:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str object:
>>> import re, locale
>>> sall = ''.join(chr(i) for i in xrange(256))
>>> len(sall)
256
>>> re.sub('[\W_]+', '', sall)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
# above output wrapped at column 80
Unicode example:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
u'abAZ\xff\u0404'
In the char set matching rule [...] you can specify ^ as first char to mean "not in"
import re
re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z
"", # replaced with nothing
"this is a test!!") # in this string
--> 'thisisatest'
'\W' is the same as [^A-Za-z0-9_] plus accented chars from your locale.
>>> re.sub('\W', '', 'text 1, 2, 3...')
'text123'
Maybe you want to keep the spaces or have all the words (and numbers):
>>> re.findall('\w+', 'my. text, --without-- (punctuation) 123')
['my', 'text', 'without', 'punctuation', '123']
Also you can try to use isalpha and isnumeric methods the following way:
text = 'base, sample test;'
getVals = lambda x: (c for c in text if c.isalpha() or c.isnumeric())
map(lambda word: ' '.join(getVals(word)): text.split(' '))
There are other ways also you may consider e.g. simply loop thru string and skip unwanted chars e.g. assuming you want to delete all ascii chars which are not letter or digits
>>> newstring = [c for c in "a!1#b$2c%3\t\nx" if c in string.letters + string.digits]
>>> "".join(newstring)
'a1b2c3x'
or use string.translate to map one char to other or delete some chars e.g.
>>> todelete = [ chr(i) for i in range(256) if chr(i) not in string.letters + string.digits ]
>>> todelete = "".join(todelete)
>>> "a!1#b$2c%3\t\nx".translate(None, todelete)
'a1b2c3x'
this way you need to calculate todelete list once or todelete can be hard-coded once and use it everywhere you need to convert string
you can use predefined regex in python : \W corresponds to the set [^a-zA-Z0-9_]. Then,
import re
s = 'Hello dutrow 123'
re.sub('\W', '', s)
--> 'Hellodutrow123'
You need to be more specific:
What about Unicode "letters"? ie, those with diacriticals.
What about white space? (I assume this is what you DO want to delete along with punctuation)
When you say "letters" do you mean A-Z and a-z in ASCII only?
When you say "numbers" do you mean 0-9 only? What about decimals, separators and exponents?
It gets complex quickly...
A great place to start is an interactive regex site, such as RegExr
You can also get Python specific Python Regex Tool

Remove duplicate chars using regex?

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['

How to substitute chars using unicode regex range

I am trying to remove chars from an unicode string. I have a whitelist of allowed unicode chars and I would like to remove everything that is not on the list.
allowed_list = ur'[\u0041-\u005A]|[\u0061-\u007A]|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u012F]|\u0131|[\u0386]|[\u0388-\u038A]'
negated_list = ur'[^\u0041-\u005A]|[^\u0061-\u007A]|[^\u00C0-\u00D6]|[^\u00D8-\u00F6]|[^\u00F8-\u012F]|^\u0131|[^\u0386]|[^\u0388-\u038A]'
I am testing it with a subset of my list and I don't get why it is not working.
This removes all but lowercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'rugg'
This removes all but uppercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'AT'
But when I combine them, all chars get removed:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]|[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
''
When I tested the regex [^\u0041-\u005A]|[^\u0061-\u007A] on https://pythex.org/ it does what I am expecting, but when I atempt to use it in my code, it is not doing what I want it to. What am I missing?
Thank you in advance!
Your regex is not correct, you are using | which checks if either one is true.
You need to create one expression with multiple ranges,
[^\u0041-\u005A\u0061-\u007A] will match any characters except range \u0041-\u005A or \u0061-\u007A.
import re
regex = r"[^\u0041-\u005A\u0061-\u007A]"
test_str = "Arugg^]T"
myre = re.compile(regex, re.UNICODE)
result = myre.sub('', test_str)
print(result)
# output,
AruggT
Implicitly positive, regex class items are OR'd together.
Your regex is then the same as
[\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
But for the Negative regex class [^], items are individually negated then AND'ed together.
That regex is then
[^\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
which is logically the same as
[^\u0041-\u005A] and [^\u0061-\u007A] and [^\u00C0-\u00D6] and [^\u00D8-\u00F6] and [^\u00F8-\u012F] and [^\u0131] and [^\u0386] and [^\u0388-\u038A]
What you tried to do was negate each item, then OR them together
which is not the same.
You are replacing all characters that are
not in '[^\u0041-\u005A]' or not in [^\u0061-\u007A]' (due to the ^) .
If either one is true, all get replaced by '' - so its always true no matter what you have.
Use ur'[^\u0041-\u005A\u0061-\u007A]' instead (both ranges inside one [...].

Categories

Resources