Find substrings matching a pattern allowing overlaps [duplicate] - python

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 3 years ago.
So I have strings that form concatenated 1's and 0's with length 12. Here are some examples:
100010011100
001111110000
001010100011
I want to isolate sections of each which start with 1, following with any numbers of zeros, and then ends with 1.
So for the first string, I would want ['10001','1001']
The second string, I would want nothing returned
The third list, I would want ['101','101','10001']
I've tried using a combination of positive lookahead and positive lookbehind, but it isn't working. This is what I've come up with so far [(?<=1)0][0(?=1)]

For a non-regex approach, you can split the string on 1. The matches you want are any elements in the resulting list with a 0 in it, excluding the first and last elements of the array.
Code:
myStrings = [
"100010011100",
"001111110000",
"001010100011"
]
for s in myStrings:
matches = ["1"+z+"1" for i, z in enumerate(s.split("1")[:-1]) if (i>0) and ("0" in z)]
print(matches)
Output:
#['10001', '1001']
#[]
#['101', '101', '10001']

I suggest writing a simple regex: r'10+1'. Then use python logic to find each match using re.search(). After each match, start the next search at the position after the beginning of the match.

Can't do it in one search with a regex.
def parse(s):
pattern = re.compile(r'(10+1)')
match = pattern.search(s)
while match:
yield match[0]
match = pattern.search(s, match.end()-1)

Related

In a string, how to find the index of the first character of the nth occurrence of a substring Python [duplicate]

This question already has answers here:
Find the nth occurrence of substring in a string
(27 answers)
Closed 1 year ago.
Suppose I have a pretty long string longString and a much shorter substring substring. I want to find the index of the first character for the nth occurrence of substring in longString. In other words, suppose substring = "stackoverflow", and I want to find the nth occurrence of "stackoverflow" in longString, and find the index of the first character of substring (which is the letter s).
Example:
longString = "stackoverflow_is_stackoverflow_not_stackoverflow_even_though_stackoverflow"
substring = "stackoverflow"
n = 2
Thus, in the above example, the index of the s in the 2nd occurrence of "stackoverflow" is 17.
I would like to find an efficient and fast way of doing so.
Here's a pretty short way:
def index_of_nth_occurrence(longstring, substring, n):
return len(substring.join(longstring.split(substring)[:n]))
longstring = "stackoverflow_is_stackoverflow_not_stackoverflow_even_though_stackoverflow"
substring = "stackoverflow"
n = 2
print(index_of_nth_occurrence(longstring, substring, n)
# 17
The trick here is using str.split() to find non-overlapping occurrences of the substring, then join back the first n of them, and check how many characters that totals up to. The very next character after would be the first character of the nth occurrence of the substring.
This may be less efficient than an iterative/manual approach, and will ignore overlapping matches, but it's quick and easy.

Select element with special character with regex and Python

From a list of strings ('16','160','1,2','100,11','1','16:','16:00'), I want to keep only the elements that
either have a comma between two digits (e.g. 1,2 or 100,11)
or have two digits (without comma) that are NOT followed by ":" (i.e. followed by nothing: e.g 16, or followed by anything but ":": e.g. 160)
I tried the following code using regex in Python:
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:[\d],[\d]|[\d][\d][^:]*)')
rate = list(filter(pattern_rate.search,string))
print(rate)
Print:
['16', '160', '1,2','100,11' '16:', '16:00']
To be correct, the script should keep the first three items and reject the rest, but my script fails at rejecting the last two items. I guess I'm using the "[^:]" sign incorrectly.
To be correct, the script should keep the first three items and reject
the rest,
You can match either 2 or more digits, or match 2 digits with a comma in between.
As the list contains only numbers, you could use re.match to start the match at the beginning of the string instead of re.search.
(?:\d{2,}|\d,\d)\Z
Explanation
(?: Non capture group
\d{2,} Match 2 or more digits
| Or
\d,\d Match 2 digits with a comma in between
) Close non capture group
\Z End of string
Regex demo | Python demo
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:\d{2,}|\d,\d)\Z')
rate = list(filter(pattern_rate.match,string))
print(rate)
Output
['16', '160', '1,2']
I recommend looking a bit deeper into a regex guide.
100 is not a digit and will not match \d. Also having groups [..] with one element inside is not necessary if you don't intend to negate or otherwise transform them.
The first query can be represented by (?:\d+,\d+). It's a non-capturing group, that detects comma-separated numbers of length greater equal to one.
Your second query will show anything matching three consecutive digits following any (*) amount of not colons.
You'll want to use something similar to (?:\d{2,}(?!:)). It's a non-capturing group, matching digits with length greater equal to two, that are not followed by a colon. ?! designates a negative lookahead.
In your python code, you'll want to use pattern_rate.match instead of pattern_rate.find as the latter one will return partial matches while the first one only returns full matches.
pattern_rate = re.compile(r'(?:\d+,\d+)|(?:\d{2,}(?!:))')
rate = list(filter(pattern_rate.match, string))
Not sure you need regex for that:
string = ['16','160','1,2','100,11','1','16:','16:00']
keep = []
for elem in string:
if ("," in elem and len(elem) == 3) or ( ":" not in elem and "," not in elem and len(elem) >= 2):
keep.append(elem)
print (keep)
Output:
['16', '160', '1,2']
Although not that much elegant, tends to be faster than using regex.

How to match a full string, instead of partial string? [duplicate]

This question already has answers here:
Order of regular expression operator (..|.. ... ..|..)
(1 answer)
Checking whole string with a regex
(5 answers)
Closed 2 years ago.
pattern = (1|2|3|4|5|6|7|8|9|10|11|12)
str = '11'
This only matches '1', not '11'. How to match the full '11'? I changed it to:
pattern = (?:1|2|3|4|5|6|7|8|9|10|11|12)
It is the same.
I am testing here first:
https://regex101.com/
It is matching 1 instead of 11 because you have 1 before 11 in your alternation. If you use re.findall then it will match 1 twice for input string 11.
However to match numbers from 1 to 12 you can avoid alternation and use:
\b[1-9]|1[0-2]?\b
It is safer to use word boundary to avoid matching within word digits.
RegEx Demo
Regex always matches left before right.
On an alternation you'd put the longest first.
However, factoring should take precedense.
(1|2|3|4|5|6|7|8|9|10|11|12)
then it turns into
1[012]?|[2-9]
https://regex101.com/r/qmlKr0/1
I purposely didn't add boundary parts as
everybody has their own preference.
do you mean this solution?
[\d]+

Regex to find repeating numbers even if they are separated

I'm trying to create a regular expression that will tell me if I have two or more repeating numbers in a string separated by a comma. For example "10,2,3,4,5,6,7,8,9,10" would return true because there are two tens.
I think I am close. So far I have:
if re.match(r"(\d+),.+\1",line):
Thanks!
You don't need regex for this. Just convert into a list using split, then convert that into a set (which will contain only the unique numbers in the list) and compare the lengths:
line = "10,2,3,4,5,6,7,8,9,10"
lst = line.split(',')
unq = set(lst)
if (len(lst) != len(unq)):
# non-unique numbers present
If you want to use regex, you need to use re.search rather than re.match, as re.match requires matches to begin at the start of the string, which would preclude matching 2 in "1,2,3,4,5,6,2,7,8,9,10". Also, you need to surround your (\d+) with word breaks (\b), so that 1 in "1,2,3,4,5,6,2,7,8,9,10" doesn't then match against the 1 in 10. This regex will give you the results you want:
m = re.search(r'\b(\d+)\b.*\b\1\b', line)
if m:
print('Duplicate number ' + m.group(1))

Understanding * (zero or more) operator using re.search() [duplicate]

This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 5 years ago.
I am new to python and was going through "Google for Education" python course
Now, the line below confuses me:
* -- 0 or more occurrences of the pattern to its left
(all the examples are in python3)
e.g. 1
In [1]: re.search(r"pi*", "piiig!!").group()
Out[1]: 'piii'
This is fine since, "pi" has 1 occurrance so it is retured
e.g. 2
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
Why does it not return "i" in fact - from my understanding, it should be returning "iii". But the result is an empty string.
Also, What exactly does "0 or more" mean? I searched on google but everywhere it is mentioned * -- 0 or more. But if there is 0 occurrence of an expression, does that not become true even if it's not there? What is the point of searching then?
I am so confused with this. Can you please help me with explaining this or point me in the right direction.
i hope the right explanation would also resolve my this issue:
In [3]: re.search(r"i?", "piiig!!").group()
Out[3]: ''
I have tried the examples in Spyder 3.2.4
The explanation is a bit more complicated than the answers we have seen so far.
First, unlike re.match() the primitive operation re.search() checks for a match anywhere in the string (this is what Perl does by default) and finds the pattern once:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string. See: Ref.
If we follow every step of the regex engine while it tries to find a match, we can observe the following for the pattern i* and the test string piigg!!:
As you can see, the first character (at position 0) produces a match because p is zero times i and the result is an empty match (and not p - because we do not search for p or any other character).
At the second character (position 1) the second match (spanning to position 2) is found since ii is zero or more times i... at position 3 there is another empty match, and so far and so forth.
Because re.search only returns the first match it sticks with the first empty match at position 0. That's why you get the (confusing) result you have posted:
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
In order to match every occurrence, you need re.findall():
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match. See: Ref.
You need to use *(0 or more) and +(1 or more) properly to get your desired output
Eg: 1 Matches because you have defined * only for "i", this patter will capture all the "p" or "pi" combination
Eg: 2 If you need to match only "i" you need to use "+" instead of "*".
If you use "*"
In: re.search(r"pi*g", "piiig!!").group()
This will return if you input is ("pig" or "piig" or "pg")
If you use "+"
In: re.search(r"pi+g", "piiig!!").group()
This will return if you input is ("pig" or "piig")
Because '' is the first matched result of r'i*' and 'iii' is the second matched result.
In [1]: import re
In [2]: re.findall(r'i*', 'piiig!!')
Out[2]: ['', 'iii', '', '', '', '']
This website will also explain the way how regular expression work.
https://regex101.com/r/XVPXMv/1
The special charecter * means 0 or more occurrence of the preceding character. For eg. a* matches with 0 or more occurrence of a which could be '', 'a', 'aa' etc. This happens because '' has 0 occurrence of a.
To get iii you should have used + instead of * and thus would have got the first non zero sequence of 'i' which is iii
re.search("i+", "piiig!!").group()

Categories

Resources