I need to find all the strings matching a pattern with the exception of two given strings.
For example, find all groups of letters with the exception of aa and bb. Starting from this string:
-a-bc-aa-def-bb-ghij-
Should return:
('a', 'bc', 'def', 'ghij')
I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)
I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:
>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)
I tried with negative look ahead, but I couldn't find a working solution for this case.
You can make use of negative look aheads.
For example,
>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']
- Matches -
(?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb
([^-]+) Matches ony or more character other than -
Edit
The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,
>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)
You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.
Use
res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)
or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:
res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)
Here is the regex demo.
Details:
- - a hyphen
(?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
(\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
(?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).
Python demo:
import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']
Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:
input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]
This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.
Related
I'm having trouble understanding regex behaviour when using lookahead.
I have a given string in which I have two overlapping patterns (starting with M and ending with p). My expected output would be MGMTPRLGLESLLEp and MTPRLGLESLLEp. My python code below results in two empty strings which share a common start with the expected output.
Removal of the lookahead (?=) results in only ONE output string which is the larger one. Is there a way to modify my regex term to prevent empty strings so that I can get both results with one regex term?
import re
string = 'GYMGMTPRLGLESLLEpApMIRVA'
pattern = re.compile(r'(?=M(.*?)p)')
sequences = pattern.finditer(string)
for results in sequences:
print(results.group())
print(results.start())
print(results.end())
The overlapping matches trick with a look-ahead makes use of the fact that the (?=...) pattern matches at an empty location, then pulls out the captured group nested inside the look-ahead.
You need to print out group 1, explicitly:
for results in sequences:
print(results.group(1))
This produces:
GMTPRLGLESLLE
TPRLGLESLLE
You probably want to include the M and p characters in the capturing group:
pattern = re.compile(r'(?=(M.*?p))')
at which point your output becomes:
MGMTPRLGLESLLEp
MTPRLGLESLLEp
I'm trying to use Python's findall to try and find all the hypenated and non-hypenated identifiers in a string (this is to plug into existing code, so using any constructs beyond findall won't work). If you imagine code like this:
regex = ...
body = "foo-bar foo-bar-stuff stuff foo-word-stuff"
ids = re.compile(regex).findall(body)
I would like the ids value to be ['foo', 'bar', 'word', 'foo-bar', 'foo-bar-stuff', and 'stuff'] (although not bar-stuff, because it's hypenated, but does not appear as a standalone space-separated identifier). Order of the array/set is not important.
A simple regex which matches the non-hypenated identifiers is \w+ and one which matches the hypenated ones is [\w-]+. However, I cannot figure out one which does both simultaneously (I don't have total control over the code, so cannot concatenate the lists together - I would like to do this in one Regex if possible).
I have tried \w|[\w-]+ but since the expression is greedy, this misses out bar for example, only matching -bar since foo has already been matched and it won't retry the pattern from the same starting position. I would like to find matches for (for example) both foo and foo-bar which begin (are anchored) at the same string position (which I think findall simply doesn't consider).
I've been trying some tricks such as lookaheads/lookbehinds such as mentioned, but I can't find any way to make them applicable to my scenario.
Any help would be appreciated.
You may use
import re
s = "foo-bar foo-bar-stuff stuff" #=> {'foo-bar', 'foo', 'bar', 'foo-bar-stuff', 'stuff'}
# s = "A-B-C D" # => {'C', 'D', 'A', 'A-B-C', 'B'}
l = re.findall(r'(?<!\S)\w+(?:-\w+)*(?!\S)', s)
res = []
for e in l:
res.append(e)
res.extend(e.split('-'))
print(set(res))
Pattern details
(?<!\S) - no non-whitespace right before
\w+ - 1+ word chars
(?:-\w+)* - zero or more repetitions of
- - a hyphen
\w+ - 1+ word chars
(?!\S) - no non-whitespace right after.
See the pattern demo online.
Note that to get all items, I split the matches with - and add these items to the resulting list. Then, with set, I remove any eventual dupes.
If you don't have to use regex
Just use split(below is example)
result = []
for x in body.split():
if x not in result:
result.append(x)
for y in x.split('-'):
if y not in result:
result.append(y)
This is not possible with findall alone, since it finds all non-overlapping matches, as the documentation says.
All you can do is find all longest matches with \w[-\w]* or something like that, and then generate all valid spans out of them (most probably starting from their split on '-').
Please note that \w[-\w]* will also match 123, 1-a, and a--, so something like(?=\D)\w[-\w]* or (?=\D)\w+(?:-\w+)* could be preferable (but you would still have to filter out the 1 from a-1).
Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$
I am looking for an efficient way to extract the shortest repeating substring.
For example:
input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'
input2 = 'cbabababac'
output2 = 'ba'
I would appreciate any answer or information related to the problem.
Also, in this post, people suggest that we can use the regular expression like
re=^(.*?)\1+$
to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).
--- follow up ---
Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.
A quick fix for this pattern could be
(.+?)\1+
Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.
In Python:
>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']
But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):
>>> r.findall("abcabcdabcd")
['abc']
Also, it may return several matches, so you'd need to find the shortest one in a second step:
>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']
Better solution:
To allow the engine to also find overlapping matches, use
(.+?)(?=\1)
This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:
>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']
Therefore, you should sort the results by length and return the shortest one:
>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'
The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.
^ matches at the start of a string. In your example the repeating substrings don't start at the beginning. Similar for $. Without ^ and $ the pattern .*? always matches empty string. Demo:
import re
def srp(s):
return re.search(r'(.+?)\1+', s).group(1)
print srp('dabcdbcdbcdd') # -> bcd
print srp('cbabababac') # -> ba
Though It doesn't find the shortest substring.
Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['