Regex that match when beginning & end is of the same length

Regex that match when beginning & end is of the same length - python

How do you make a regex that match when the beginning and the end is of the same length?
For example
>>> String = '[[A], [[B]], [C], [[D]]]'
>>> Result = re.findall(pattern, String)
>>> Result
>>> [ '[A]', '[[B]]', '[C]', '[[D]]' ]
Currently I use the pattern \[.*?\] but it resulted in
>>> ['[[A]', '[[B]', '[C]', '[[D]']
Thanks in advance.

You can define such a regular expression for a finite number of beginning/end characters (ie, something like "if it starts and ends with 1, or starts and ends with 2, or etc"). You, however, cannot do this for an unlimited number of characters. This is simply a fact of regular expressions. Regular expressions are the language of finite-state machines, and finite-state machines cannot do counting; at least the power of a pushdown-automaton (context-free grammar) is needed for that.
Put simply, a regular expression can say: "I saw x and then I saw y" but it cannot say "I saw x and then I saw y the same number of times" because it cannot remember how many times it saw x.
However, you can easily do this using the full power of the Python programming language, which is Turing-complete! Turing-complete languages can definitely do counting:
>>> string = '[[A], [[B]], [C], [[D]]]'
>>> sameBrackets = lambda s: len(re.findall('\[',s)) == len(re.findall('\]',s))
>>> filter(sameBrackets, string.split(", "))
['[[B]]', '[C]']

You can't. Sorry.
Python's regular expressions are an extension of "finite state automata", which only allow a finite amount of memory to be kept as you scan through the string for a match. This example requires an arbitrary amount of memory, depending on how many repetitions there are.
The only way in which Python allows more than just finite state is with "backreferences", which let you match an identical copy of a previously matched portion of the string -- but they don't allow you to match something with, say, the same number of characters.
You should try writing this by hand, instead.

To match balanced brackets you need a recursive regular expression. The stock re module doesn't support this syntax, but the alternative regex does:
import regex
r = r'\[(([^\[\]]+)|(?R))*\]'
print regex.match(r, '[[A], [[B]], [C], [[D]] ]') # ok
print regex.match(r, '[[A], [[B]], [C , [[D]] ]') # None
That expression basically says: match something surrounded by brackets, where "something" is either a series of non-brackets ([^\[\]]+) or the whole thing once again (?R).

Related

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Choose best fitting regex [duplicate]

Is it possible to match 2 regular expressions in Python?
For instance, I have a use-case wherein I need to compare 2 expressions like this:
re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)
I would expect to be returned a RE object.
But obviously, Python expects a string as the second parameter.
Is there a way to achieve this, or is it a limitation of the way regex matching works?
Background: I have a list of regular expressions [r1, r2, r3, ...] that match a string and I need to find out which expression is the most specific match of the given string. The way I assumed I could make it work was by:
(1) matching r1 with r2.
(2) then match r2 with r1.
If both match, we have a 'tie'. If only (1) worked, r1 is a 'better' match than r2 and vice-versa.
I'd loop (1) and (2) over the entire list.
I admit it's a bit to wrap one's head around (mostly because my description is probably incoherent), but I'd really appreciate it if somebody could give me some insight into how I can achieve this. Thanks!

Outside of the syntax clarification on re.match, I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string.
Recall for a moment that a Python regex really is a type of computer program. Most modern forms, including Python's regex, are based on Perl. Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Indeed a rogue regex can be used as a form of denial of service attack.
To see of this on your own computer, try:
>>> re.match(r'^(a+)+$','a'*24+'!')
That takes about 1 second on my computer. Now increase the 24 in 'a'*24 to a bit larger number, say 28. That take a lot longer. Try 48... You will probably need to CTRL+C now. The time increase as the number of a's increase is, in fact, exponential.
You can read more about this issue in Russ Cox's wonderful paper on 'Regular Expression Matching Can Be Simple And Fast'. Russ Cox is the Goggle engineer that built Google Code Search in 2006. As Cox observes, consider matching the regex 'a?'*33 + 'a'*33 against the string of 'a'*99 with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 1015 years because of exponential back tracking.
So the conclusion is: it depends! What do you mean by a more specific match? Look at some of Cox's regex simplification techniques in RE2. If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (i.e., no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways.
If you are looking for a simple way to state that (regex_3 < regex_1 < regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (i.e., this problem is NP Complete)
Edit
Everything I said above is true! However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is.
You be the judge if this works (I don't know what 'specific' means to you for your application):
import re
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little']
for reg in regs:
m=re.search(reg,s)
if m:
print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0))
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
print " %i edits regex->match(0), %i edits match(0)->s" % (ld1,ld2)
print " score: ", score
d[reg]=score
print
else:
print "'%s' does not match '%s'" % (reg, s)
print " ===== %s ===== === %s ===" % ('RegEx'.center(10),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %22s %5s" % (key, value)
The program is taking a list of regex's and matching against the string Mary had a little lamb.
Here is the sorted ranking from "most specific" to "least specific":
===== RegEx ===== === Score ===
Mary had a little lamb 0
Mary.*little lamb 7
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\blittle\b 16
little 16
Mary 18
\b\w+mb 18
lamb 18
.* 22
This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; b) the edits to get from the matching substring to the initial string. (just take one)
As two simple examples:
.* (or .*.* or .*?.* etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. This is the max possible edits, the highest score, and the least 'specific' regex.
The regex of the string itself against the string is as specific as possible. No edits to change one to the other resulting in a 0 or lowest score.
As stated, this is simplistic. Anchors should increase specificity but they do not in this case. Very short stings don't work because the wild-card may be longer than the string.
Edit 2
I got anchor parsing to work pretty darn well using the undocumented sre_parse module in Python. Type >>> help(sre_parse) if you want to read more...
This is the goto worker module underlying the re module. It has been in every Python distribution since 2001 including all the P3k versions. It may go away, but I don't think it is likely...
Here is the revised listing:
import re
import sre_parse
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little',
r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb',
r'.*\blittle\b \blamb$','^'+s+'$']
for reg in regs:
m=re.search(reg,s)
if m:
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
for t, v in sre_parse.parse(reg):
if t=='at': # anchor...
if v=='at_beginning' or 'at_end':
score-=1 # ^ or $, adj 1 edit
if v=='at_boundary': # all other anchors are 2 char
score-=2
d[reg]=score
else:
print "'%s' does not match '%s'" % (reg, s)
print
print " ===== %s ===== === %s ===" % ('RegEx'.center(15),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %27s %5s" % (key, value)
And soted RegEx's:
===== RegEx ===== === Score ===
Mary had a little lamb 0
^Mary had a little lamb$ 0
.*\blittle\b \blamb$ 6
Mary.*little lamb 7
.*\b[lL]ittle\b \b[Ll]amb 10
\blittle\b 10
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\b\w+mb 15
little 16
^.*lamb 17
Mary 18
lamb 18
.*.*.*b 21
.* 22
.*?.* 22

It depends on what kind of regular expressions you have; as #carrot-top suggests, if you actually aren't dealing with "regular expressions" in the CS sense, and instead have crazy extensions, then you are definitely out of luck.
However, if you do have traditional regular expressions, you might make a bit more progress. First, we could define what "more specific" means. Say R is a regular expression, and L(R) is the language generated by R. Then we might say R1 is more specific than R2 if L(R1) is a (strict) subset of L(R2) (L(R1) < L(R2)). That only gets us so far: in many cases, L(R1) is neither a subset nor a superset of L(R2), and so we might imagine that the two are somehow incomparable. An example, trying to match "mary had a little lamb", we might find two matching expressions: .*mary and lamb.*.
One non-ambiguous solution is to define specificity via implementation. For instance, convert your regular expression in a deterministic (implementation-defined) way to a DFA and simply count states. Unfortunately, this might be relatively opaque to a user.
Indeed, you seem to have an intuitive notion of how you want two regular expressions to compare, specificity-wise. Why not simple write down a definition of specificity, based on the syntax of regular expressions, that matches your intuition reasonably well?
Totally arbitrary rules follow:
Characters = 1.
Character ranges of n characters = n (and let's say \b = 5, because I'm not sure how you might choose to write it out long-hand).
Anchors are 5 each.
* divides its argument by 2.
+ divides its argument by 2, then adds 1.
. = -10.
Anyway, just food for thought, as the other answers do a good job of outlining some of the issues you're facing; hope it helps.

I don't think it's possible.
An alternative would be to try to calculate the number of strings of length n that the regular expression also matches. A regular expression that matches 1,000,000,000 strings of length 15 characters is less specific than one that matches only 10 strings of length 15 characters.
Of course, calculating the number of possible matches is not trivial unless the regular expressions are simple.

Option 1:
Since users are supplying the regexes, perhaps ask them to also submit some test strings which they think are illustrative of their regex's specificity. (i.e. that show their regex is more specific than a competitor's regex.) Collect all the user's submitted test strings, and then test all the regexes against the complete set of test strings.
To design a good regex, the author must have put thought into what strings match and don't match their regex, so it should be easy for them to supply good test strings.
Option 2:
You might try a Monte Carlo approach: Starting with the string that both regexes match, write a generator which generates mutations of that string (permute characters, add/remove characters, etc.) If both regexes match or don't match the same way for each mutation, then the regexes "probably tie". If one matches a mutations that the other doesn't, and vice versa, then they "absolutely tie".
But if one matches a strict superset of mutations then it is "probably less specific" than the other.
The verdict after a large number of mutations may not always be correct, but may be reasonable.
Option 3:
Use ipermute or pyParsing's invert to generate strings which match each regex. This will only work on a regexes that use a limited subset of regex syntax.

I think you could do it by looking the result of matching with the longest result
>>> m = re.match(r'google\.com\/maps','google.com/maps/hello')
>>> len(m.group(0))
15
>>> m = re.match(r'google\.com\/maps2','google.com/maps/hello')
>>> print (m)
None
>>> m = re.match(r'google\.com\/maps','google.com/maps2/hello')
>>> len(m.group(0))
15
>>> m = re.match(r'google\.com\/maps2','google.com/maps2/hello')
>>> len(m.group(0))
16

re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)
The second item to re.match() above is a string -- that's why it's not working: the regex says to match a period after google, but instead it finds a backslash. What you need to do is double up the backslashes in the regex that's being used as a regex:
def compare_regexes(regex1, regex2):
"""returns regex2 if regex1 is 'smaller' than regex2
returns regex1 if they are the same
returns regex1 if regex1 is 'bigger' than regex2
otherwise returns None"""
regex1_mod = regex1.replace('\\', '\\\\')
regex2_mod = regex2.replace('\\', '\\\\')
if regex1 == regex2:
return regex1
if re.match(regex1_mod, regex2):
return regex2
if re.match(regex2_mod, regex1):
return regex1
You can change the returns to whatever suits your needs best. Oh, and make sure you are using raw strings with re. r'like this, for example'

Is it possible to match 2 regular expressions in Python?
That certainly is possible. Use parenthetical match groups joined by | for alteration. If you arrange the parenthetical match groups by most specific regex to least specific, the rank in the returned tuple from m.groups() will show how specific your match is. You can also use named groups to name how specific your match is, such as s10 for very specific and s0 for a not so specific match.
>>> s1='google.com/maps2text'
>>> s2='I forgot my goggles at the house'
>>> s3='blah blah blah'
>>> m1=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s1)
>>> m2=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s2)
>>> m1.groups()
('google.com/maps2text', None)
>>> m2.groups()
(None, 'I forgot my goggles')
>>> patt=re.compile(r'(?P<s10>^google\.com\/maps\dtext$)|
... (?P<s5>.*go[a-z]+)|(?P<s0>[a-z]+)')
>>> m3=patt.match(s3)
>>> m3.groups()
(None, None, 'blah')
>>> m3.groupdict()
{'s10': None, 's0': 'blah', 's5': None}
If you do not know ahead of time which regex is more specific, this is a much harder problem to solve. You want to have a look at this paper covering security of regex matches against file system names.

I realize that this is a non-solution, but as there is no unambiguous way to tell which is the "most specific match", certainly when it depends on what your users "meant", the easiest thing to do would be to ask them to provide their own priority. For example just by putting the regexes in the right order. Then you can simply take the first one that matches. If you expect the users to be comfortable with regular expressions anyway, this is maybe not too much to ask?

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!

Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.

you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

regular expression that reference a match from earlier part of expression

I'm looking for a regular expression that will identify a sequence in which an integer in the text specifies the number of trailing letters at the end of the expression. This specific example applies to identifying insertions and deletions in genetic data in the pileup format.
For example:
If the text I am searching is:
AtT+3ACGTTT-1AaTTa
I need to match the insertions and deletions, which in this case are +3ACG and -1A. The integer (n) portion can be any integer larger than 1, and I must capture the n trailing characters.
I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn], but I can't figure out how to grab the exact number of trailing ACGTN's specified by the integer.
I apologize if there is an obvious answer here, I have been searching for hours. Thanks!
(UPDATE)
I typically work in Python. The one workaround I've been able to figure out with the re module in python is to call both the integers and span of every in/del and combine the two to extract the appropriate length of text.
For example:
>>> import re
>>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG'
>>> expr = '[+-]?([0-9]+)[ACGTNacgtn]'
>>> ints = re.findall(expr, a) #returns a list of the integers
>>> spans = [i.span() for i in re.finditer(expr,a)]
>>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))]
>>> newspans
>>> [(14, 17), (17, 20), (20, 26)]
The resulting tuples allow me to slice out the indels. Probably not the best syntax, but it works!

You can use regular expression substitution passing a function as replacement... for example
s = "abcde+3fghijkl-1mnopqr+12abcdefghijklmnoprstuvwxyz"
import re
def dump(match):
start, end = match.span()
print s[start:end + int(s[start+1:end])]
re.sub(r'[-+]\d+', dump, s)
#output
# +3fgh
# -1m
# +12abcdefghijkl

It's not directly possible, regexes can't 'count' like that.
But if you're using a programming language that allows callbacks as a regex match evaluator (e.g. C#, PHP), then what you could do is have the regex as [+-]?([0-9]+)([ACGTNacgtn]+) and in the callback trim the trailing characters to the desired length.
e.g. for C#
var regexMatches = new List<string>();
Regex theRegex = new Regex(#"[+-]?([0-9]+)([ACGTNacgtn]+)");
text = theRegex.Replace(text, delegate(Match thisMatch)
{
int numberOfInsertsOrDeletes = Convert.ToInt32(thisMatch.Groups[1].Value);
string trailingString = thisMatch.Groups[2].Value;
if (numberOfInsertsOrDeletes > trailingString.Length)
{ trailingString = trailingString.Substring(0, numberOfInsertsOrDeletes); }
regexMatches.Add(trailingString);
return thisMatch.Groups[0].Value;
});

The simple Perl pattern for matching an integer followed by that number of any character is just:
(\d+)(??{"." x $1})
which is quite straight-forward, I think you’ll agree. For example, this snippet:
my $string = "AtT+3ACGTTT-1AaTTa";
print "Matched $&\n" while $string =~ m{
( \d+ ) # capture an integer into $1
(??{ "." x $1 }) # interpolate that many dots back into pattern
}xg;
Merrily prints out the expected
Matched 3ACG
Matched 1A
EDIT
Oh drat, I see you just added the Python tag since I began editing. Oops. Well, maybe this will be helpful to you anyway.
That said, if what you are actually looking for is fuzzy matching where you allow for some number of insertions and deletions (the edit distance), then Matthew Barnett’s regex library for Python will handle that. That doesn’t seem to be quite what you’re doing, as the insertions and deletions are actually represented in your strings.
But Matthew’s library is really very good and very interesting, and it even does many things that Perl cannot do. :) It’s a drop-in replacement for the standard Python re library.

python regex in pyparsing

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()

Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.

If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex that match when beginning & end is of the same length - python

Related

Python re.sub() is not replacing every match

Choose best fitting regex [duplicate]

Python Regular Expressions Findall

regular expression that reference a match from earlier part of expression

python regex in pyparsing

Categories

Resources