I am looking for an efficient way to extract the shortest repeating substring.
For example:
input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'
input2 = 'cbabababac'
output2 = 'ba'
I would appreciate any answer or information related to the problem.
Also, in this post, people suggest that we can use the regular expression like
re=^(.*?)\1+$
to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).
--- follow up ---
Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.
A quick fix for this pattern could be
(.+?)\1+
Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.
In Python:
>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']
But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):
>>> r.findall("abcabcdabcd")
['abc']
Also, it may return several matches, so you'd need to find the shortest one in a second step:
>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']
Better solution:
To allow the engine to also find overlapping matches, use
(.+?)(?=\1)
This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:
>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']
Therefore, you should sort the results by length and return the shortest one:
>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'
The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.
^ matches at the start of a string. In your example the repeating substrings don't start at the beginning. Similar for $. Without ^ and $ the pattern .*? always matches empty string. Demo:
import re
def srp(s):
return re.search(r'(.+?)\1+', s).group(1)
print srp('dabcdbcdbcdd') # -> bcd
print srp('cbabababac') # -> ba
Though It doesn't find the shortest substring.
Related
I have some sentence and a regular expression. Is it possible to find out till where in the regex my sentence satisfies. For example consider my sentence as MMMV and regex as M+V?T*Z+. Now regex till M+V? satisfies the sentences and the remaining part of regex is T*Z+ which should be my output.
My approach right now is to break the regex in individual parts and store that in a list and then match by concatenating first n parts till sentence matches. For example if my regex is M+V?T*Z+, then my list is ['M+', 'V?', 'T*', 'Z+']. I then match my string in loop first by M+, second by M+V? and so on till complete match is found and then take the remaining list as output. Below is the code
re_exp = ['M+', 'V?', 'T*', 'Z+']
for n in range(len(re_exp)):
re_expression = ''.join(re_exp[:n+1])
if re.match(r'{0}$'.format(re_expression), sentence_language):
return re_exp[n+1:]
Is there a better approach to achieve this may be by using some parsing library etc.
Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.
You can use () to enclose groups in regex. For example: M+V?(T*Z+), the output you want is stored in the first group of the regex.
I know the question says python, but here you can see the regex in action:
const regex = /M+V?(T*Z+)/;
const str = `MMMVTZ`;
let m = regex.exec(str);
console.log(m[1]);
I'm trying to use Python's findall to try and find all the hypenated and non-hypenated identifiers in a string (this is to plug into existing code, so using any constructs beyond findall won't work). If you imagine code like this:
regex = ...
body = "foo-bar foo-bar-stuff stuff foo-word-stuff"
ids = re.compile(regex).findall(body)
I would like the ids value to be ['foo', 'bar', 'word', 'foo-bar', 'foo-bar-stuff', and 'stuff'] (although not bar-stuff, because it's hypenated, but does not appear as a standalone space-separated identifier). Order of the array/set is not important.
A simple regex which matches the non-hypenated identifiers is \w+ and one which matches the hypenated ones is [\w-]+. However, I cannot figure out one which does both simultaneously (I don't have total control over the code, so cannot concatenate the lists together - I would like to do this in one Regex if possible).
I have tried \w|[\w-]+ but since the expression is greedy, this misses out bar for example, only matching -bar since foo has already been matched and it won't retry the pattern from the same starting position. I would like to find matches for (for example) both foo and foo-bar which begin (are anchored) at the same string position (which I think findall simply doesn't consider).
I've been trying some tricks such as lookaheads/lookbehinds such as mentioned, but I can't find any way to make them applicable to my scenario.
Any help would be appreciated.
You may use
import re
s = "foo-bar foo-bar-stuff stuff" #=> {'foo-bar', 'foo', 'bar', 'foo-bar-stuff', 'stuff'}
# s = "A-B-C D" # => {'C', 'D', 'A', 'A-B-C', 'B'}
l = re.findall(r'(?<!\S)\w+(?:-\w+)*(?!\S)', s)
res = []
for e in l:
res.append(e)
res.extend(e.split('-'))
print(set(res))
Pattern details
(?<!\S) - no non-whitespace right before
\w+ - 1+ word chars
(?:-\w+)* - zero or more repetitions of
- - a hyphen
\w+ - 1+ word chars
(?!\S) - no non-whitespace right after.
See the pattern demo online.
Note that to get all items, I split the matches with - and add these items to the resulting list. Then, with set, I remove any eventual dupes.
If you don't have to use regex
Just use split(below is example)
result = []
for x in body.split():
if x not in result:
result.append(x)
for y in x.split('-'):
if y not in result:
result.append(y)
This is not possible with findall alone, since it finds all non-overlapping matches, as the documentation says.
All you can do is find all longest matches with \w[-\w]* or something like that, and then generate all valid spans out of them (most probably starting from their split on '-').
Please note that \w[-\w]* will also match 123, 1-a, and a--, so something like(?=\D)\w[-\w]* or (?=\D)\w+(?:-\w+)* could be preferable (but you would still have to filter out the 1 from a-1).
I need to find all the strings matching a pattern with the exception of two given strings.
For example, find all groups of letters with the exception of aa and bb. Starting from this string:
-a-bc-aa-def-bb-ghij-
Should return:
('a', 'bc', 'def', 'ghij')
I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)
I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:
>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)
I tried with negative look ahead, but I couldn't find a working solution for this case.
You can make use of negative look aheads.
For example,
>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']
- Matches -
(?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb
([^-]+) Matches ony or more character other than -
Edit
The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,
>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)
You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.
Use
res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)
or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:
res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)
Here is the regex demo.
Details:
- - a hyphen
(?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
(\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
(?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).
Python demo:
import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']
Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:
input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]
This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.
Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.
Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['