python regex in pyparsing

python regex in pyparsing - python

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()

Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.

If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Related

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?

Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.

This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'

Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True

Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

RegEx to match a term before OR after another specific term

I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)

You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))

Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...

regular expression that reference a match from earlier part of expression

I'm looking for a regular expression that will identify a sequence in which an integer in the text specifies the number of trailing letters at the end of the expression. This specific example applies to identifying insertions and deletions in genetic data in the pileup format.
For example:
If the text I am searching is:
AtT+3ACGTTT-1AaTTa
I need to match the insertions and deletions, which in this case are +3ACG and -1A. The integer (n) portion can be any integer larger than 1, and I must capture the n trailing characters.
I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn], but I can't figure out how to grab the exact number of trailing ACGTN's specified by the integer.
I apologize if there is an obvious answer here, I have been searching for hours. Thanks!
(UPDATE)
I typically work in Python. The one workaround I've been able to figure out with the re module in python is to call both the integers and span of every in/del and combine the two to extract the appropriate length of text.
For example:
>>> import re
>>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG'
>>> expr = '[+-]?([0-9]+)[ACGTNacgtn]'
>>> ints = re.findall(expr, a) #returns a list of the integers
>>> spans = [i.span() for i in re.finditer(expr,a)]
>>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))]
>>> newspans
>>> [(14, 17), (17, 20), (20, 26)]
The resulting tuples allow me to slice out the indels. Probably not the best syntax, but it works!

You can use regular expression substitution passing a function as replacement... for example
s = "abcde+3fghijkl-1mnopqr+12abcdefghijklmnoprstuvwxyz"
import re
def dump(match):
start, end = match.span()
print s[start:end + int(s[start+1:end])]
re.sub(r'[-+]\d+', dump, s)
#output
# +3fgh
# -1m
# +12abcdefghijkl

It's not directly possible, regexes can't 'count' like that.
But if you're using a programming language that allows callbacks as a regex match evaluator (e.g. C#, PHP), then what you could do is have the regex as [+-]?([0-9]+)([ACGTNacgtn]+) and in the callback trim the trailing characters to the desired length.
e.g. for C#
var regexMatches = new List<string>();
Regex theRegex = new Regex(#"[+-]?([0-9]+)([ACGTNacgtn]+)");
text = theRegex.Replace(text, delegate(Match thisMatch)
{
int numberOfInsertsOrDeletes = Convert.ToInt32(thisMatch.Groups[1].Value);
string trailingString = thisMatch.Groups[2].Value;
if (numberOfInsertsOrDeletes > trailingString.Length)
{ trailingString = trailingString.Substring(0, numberOfInsertsOrDeletes); }
regexMatches.Add(trailingString);
return thisMatch.Groups[0].Value;
});

The simple Perl pattern for matching an integer followed by that number of any character is just:
(\d+)(??{"." x $1})
which is quite straight-forward, I think you’ll agree. For example, this snippet:
my $string = "AtT+3ACGTTT-1AaTTa";
print "Matched $&\n" while $string =~ m{
( \d+ ) # capture an integer into $1
(??{ "." x $1 }) # interpolate that many dots back into pattern
}xg;
Merrily prints out the expected
Matched 3ACG
Matched 1A
EDIT
Oh drat, I see you just added the Python tag since I began editing. Oops. Well, maybe this will be helpful to you anyway.
That said, if what you are actually looking for is fuzzy matching where you allow for some number of insertions and deletions (the edit distance), then Matthew Barnett’s regex library for Python will handle that. That doesn’t seem to be quite what you’re doing, as the insertions and deletions are actually represented in your strings.
But Matthew’s library is really very good and very interesting, and it even does many things that Perl cannot do. :) It’s a drop-in replacement for the standard Python re library.

Regex that match when beginning & end is of the same length

How do you make a regex that match when the beginning and the end is of the same length?
For example
>>> String = '[[A], [[B]], [C], [[D]]]'
>>> Result = re.findall(pattern, String)
>>> Result
>>> [ '[A]', '[[B]]', '[C]', '[[D]]' ]
Currently I use the pattern \[.*?\] but it resulted in
>>> ['[[A]', '[[B]', '[C]', '[[D]']
Thanks in advance.

You can define such a regular expression for a finite number of beginning/end characters (ie, something like "if it starts and ends with 1, or starts and ends with 2, or etc"). You, however, cannot do this for an unlimited number of characters. This is simply a fact of regular expressions. Regular expressions are the language of finite-state machines, and finite-state machines cannot do counting; at least the power of a pushdown-automaton (context-free grammar) is needed for that.
Put simply, a regular expression can say: "I saw x and then I saw y" but it cannot say "I saw x and then I saw y the same number of times" because it cannot remember how many times it saw x.
However, you can easily do this using the full power of the Python programming language, which is Turing-complete! Turing-complete languages can definitely do counting:
>>> string = '[[A], [[B]], [C], [[D]]]'
>>> sameBrackets = lambda s: len(re.findall('\[',s)) == len(re.findall('\]',s))
>>> filter(sameBrackets, string.split(", "))
['[[B]]', '[C]']

You can't. Sorry.
Python's regular expressions are an extension of "finite state automata", which only allow a finite amount of memory to be kept as you scan through the string for a match. This example requires an arbitrary amount of memory, depending on how many repetitions there are.
The only way in which Python allows more than just finite state is with "backreferences", which let you match an identical copy of a previously matched portion of the string -- but they don't allow you to match something with, say, the same number of characters.
You should try writing this by hand, instead.

To match balanced brackets you need a recursive regular expression. The stock re module doesn't support this syntax, but the alternative regex does:
import regex
r = r'\[(([^\[\]]+)|(?R))*\]'
print regex.match(r, '[[A], [[B]], [C], [[D]] ]') # ok
print regex.match(r, '[[A], [[B]], [C , [[D]] ]') # None
That expression basically says: match something surrounded by brackets, where "something" is either a series of non-brackets ([^\[\]]+) or the whole thing once again (?R).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex in pyparsing - python

If you strip the subgroup sign(the parenthesis), you'll get the right answer:) >>> re.split("\w+lab\d+", "abclab1") ['', '']

Related

Python re.sub() is not replacing every match

Regular Expression (find matching characters in order)

RegEx to match a term before OR after another specific term

regular expression that reference a match from earlier part of expression

Regex that match when beginning & end is of the same length

Categories

Resources