Print substrings between substrings - python

Very general case but i failed over and over trying to solve it and the proposed solutions i found also had similar problems. (I think this case should be usefull for anyone trying to extract specific sets of info from large pieces of code or structured files like logs)
sample string:
"123string1abcabcstring2123string3abc123string...nabc"
substring A: "123"
substring B: "abc"
Lets say that we want to find all substrings that are between the substring A and the substring B, but not the ones that are between B and A or the ones that are between A and B but also contain B ("string 1abc" should not be printed)
The result printed on the console should look like this:
string 1
string 3
string...n

This is perfectly suited for regular expressions, in particular re.findall to get multiple matches:
>>> s="123string 1abcabcstring 2123string 3abc123string...nabc"
>>> import re
>>> re.findall('123(.*?)abc', s)
['string 1', 'string 3', 'string...n']
This will get a sequence of characters between 123 and abc. Using .*? instead of .* is important so that it will match the shortest possible string -- i.e. up to the first occurence of "abc". Otherwise it would have matched up to the last "abc" in the string.

re module is your friend for such problems :
>>> import re
>>> s = "123string 1abcabcstring 2123string 3abc123string...nabc"
>>> s1 = "123"
>>> s2 = "abc"
>>> m = re.findall(s1+ "(.*?)"+ s2, s)
>>> m
['string 1', 'string 3', 'string...n']
That way you can even keep the delimiting strings in variables ...
Of course, if the delimiting strings were containing special characters, they should be escaped. For example for ab( I would have written s1 = "ab\("

Related

Python regex - Find and count char sequence

I am trying to find (and count) a sequence of joined or separated chars, like this : "abc" ("b" must follow "a" and "c" must follow "b". Case insensitive)
"A big duck!" -> the pattern should be matched once.
"A big duckabc!" -> The pattern should be matched twice.
The more I read about regex, the less I know. Is this a matter of using lookahead?
You can use the regex a.*?b.*?c to find a, followed by b, followed by c, with some optional characters in between. The *? makes those in-between strings non-greedy (otherwise you would get only one match for the second example).
>>> p = "a.*?b.*?c"
>>> re.findall(p, "A big duck!", flags=re.I) # re.I == ignore case
['A big duc']
>>> re.findall(p, "A big duckabc!", flags=re.I)
['A big duc', 'abc']
You can also construct that regex from the characters you want to join:
>>> chars = "abc"
>>> p = ".*?".join(chars)
To get the number of matches, just get the len of the result list.
Note: This does not handle overlapping matches, i.e. re.findall(p, "aaabbbccc", flags=re.I) will return only one match. Please clarify whether this is an issue.

Finding longest consecutive sequence of a character

Assume we have a string like 'w q a a a a a e d a a', I would like to find the longest sequence of 'a' with length of at least 2, which is 'a a a a a' in the above example. I tried the following:
re.findall(r'(a a a*)', text)
but it only gives the shortest possible match. Then I tried:
re.findall(r'([^a] a a a* [^a])', text)
but the results for the above example string is empty. How can I do that?
That's because you have space between you a character. You can use a character class which match any combination of a and space with length 5 or more:
>>> re.findall(r'([a ]{5,})', text)
[' a a a a a ']
Also note that you don't need a capture group around the whole of your regex, In that case you can use a none-capture group with a and space (to refuse of matching the short pattern) and since you just want one match you can use re.search():
>>> M = re.search(r'(?:a ){2,}', text)
>>>
>>> M.group(0)
'a a a a a '

regex issue with numbers (python)

I have a string thath goes like that:
<123, 321>
the range of the numbers can be between 0 to 999.
I need to insert those coordinates as fast as possible into two variables, so i thought about regex. I've already splitted the string to two parts and now i need to isolate the integer from all the other characters.
I've tried this pattern:
^-?[0-9]+$
but the output is:
[]
any help? :)
If your strings follow the same format <123, 321> then this should be a little bit faster than regex approach
def str_translate(s):
return s.translate(None, " <>").split(',')
In [52]: str_translate("<123, 321>")
Out[52]: ['123', '321']
All you need to do is to get rid of the anchors( ^ and $)
>>> import re
>>> string = "<123, 321>"
>>> re.findall(r"-?[0-9]+", string)
['123', '321']
>>>
Note ^ $ Anchors at the start and end of patterns -?[0-9]+ ensures that the string consists only of digits.
That is the regex engine attempts to match the pattern from the start of the string, ^ using -?[0-9]+ till the end of the string $. But it fails because < cannot be matched by -?[0-9]+
Where as the re.findall will find all substrings that match the pattern -?[0-9]+, that is the digits.
"^-?[0-9]+$" will only match a string that contains a number and nothing else.
You want to match a group and the extract that group:
>>> pattern = re.compile("(-?[0-9]+)")
>>> pattern.findall("<123, 321>")
['123', '321']

Searching a string and returning only things I specify

Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.

Shortest Repeating Sub-String

I am looking for an efficient way to extract the shortest repeating substring.
For example:
input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'
input2 = 'cbabababac'
output2 = 'ba'
I would appreciate any answer or information related to the problem.
Also, in this post, people suggest that we can use the regular expression like
re=^(.*?)\1+$
to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).
--- follow up ---
Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.
A quick fix for this pattern could be
(.+?)\1+
Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.
In Python:
>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']
But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):
>>> r.findall("abcabcdabcd")
['abc']
Also, it may return several matches, so you'd need to find the shortest one in a second step:
>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']
Better solution:
To allow the engine to also find overlapping matches, use
(.+?)(?=\1)
This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:
>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']
Therefore, you should sort the results by length and return the shortest one:
>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'
The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.
^ matches at the start of a string. In your example the repeating substrings don't start at the beginning. Similar for $. Without ^ and $ the pattern .*? always matches empty string. Demo:
import re
def srp(s):
return re.search(r'(.+?)\1+', s).group(1)
print srp('dabcdbcdbcdd') # -> bcd
print srp('cbabababac') # -> ba
Though It doesn't find the shortest substring.

Categories

Resources