How to find the index of undetermined pattern in a string? [duplicate] - python

This question already has answers here:
Python Regex - How to Get Positions and Values of Matches
(4 answers)
Closed 6 years ago.
I want to find the index of multiple occurrences of at least two zeros followed by at least two ones (e.g., '0011','00011', '000111' and so on), from a string (called 'S')
The string S may look like:
'00111001100011'
The code I tried can only spot occurrences of '0011', and strangely returns the index of the first '1'. For example for the S above, my code returns 2 instead of 0:
index = []
index = [n for n in range(len(S)) if S.find('0011', n) == n]
Then I tried to use regular expression but I the regex I found can't express the specific digit I want (like '0' and '1')
Could anyone kindly come up with a solution, and tell me why my first result returns index of '1' instead of '0'? Lot's f thanks in advance!!!!!

In the following code the regex defines a single instance of the required pattern of digits. Then uses the finditer iterator of the regex to identify successive matches in the given string S. match.start() gives the starting position of each of these matches, and the entire list is returned to starts.
S = '00111001100011'
r = re.compile(r'(0{2,}1{2,})')
starts = [match.start() for match in r.finditer(S)]
print(starts)
# [0, 5, 9]

Related

Finding all possible 5-letters combinations in a word using Regex (Python) [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
How do I get a substring of a string in Python? [duplicate]
(16 answers)
Closed 2 years ago.
I'm new to Python and to Regex. Here is my current problem, for which I have not managed to find any straight answer online.
I have a string of 5 or more characters, for which I need to search for all the possible combinations of 5 characters.
I wonder if it's doable with regular expressions (instead of, say, creating a list of all possible 5-character combinations and then testing them in loop with my string).
For example, let's say my string is "stackoverflow", I need an expression that could give me a list containing all the possible combinations of 5 successive letters, such as: ['stack', 'tacko', ackov', ...]. (but not 'stcko' or 'wolfr' for example).
That's what I would try:
import re
word = "stackoverflow"
list = re.findall(r".....", word)
But printing this list would only give:
['stack', 'overfl']
Thus it seems that a position can only be matched once, a 5-character combination cannot concern a position that has already been matched.
Could anyone help me better understand how regex work in this situation, and if my demand is even possible directly using regular expressions?
Thanks!
If the letters are always consecutive, this will work:
wd = "stackoverflow"
lst = ["".join(wd[i:i+5]) for i in range(len(wd)-4)]
print(lst)
Output
['stack', 'tacko', 'ackov', 'ckove', 'kover', 'overf', 'verfl', 'erflo', 'rflow']
I think you could just use a simple loop with a sliding window of size 5
word = "stackoverflow"
result=[]
for i in range(len(word)-5):
result.append(word[i:i+5])
print(result)
This is quite efficient as it runs on O(n) linear time
Because as I can see in findall documentation string it returns all non-overlapping matches:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
Look at solutions without regex usage in your topic.

How can I use two patterns in re.search()? [duplicate]

This question already has answers here:
Python RegEx that matches char followed/preceded by same char but uppercase/lowercase
(2 answers)
Closed 3 years ago.
I have the following string string = DdCcaaBbbB. I want to delete all the combinations of the same letter that are of the following form, being x any letter: xX, Xx.
And I want to delete them one by one, in the example, first I would delete Dd, after Cc, Bb and finally bB.
What I have done so far is:
for letter in string.lower():
try:
string = string.replace(re.search(letter + letter.upper(), string).group(),'')
except:
try:
string = string.replace(re.search(letter.upper() + letter, string).group(),'')
except:
pass
But I am sure this is not the most pythonic way to do it. What has come up to my mind, and thus the question, is if I could combine the two patterns I am searching for. Any other suggestion or improvement is more than welcome!
I think you can do a case-insensitive regex search to find all combinations of the same two letters, then have a function check if they're of the xX or Xx format before deciding if it should be replaced (by nothing) or left alone.
def replacer(match):
text = match.group()
if (text[0].islower() and text[1].isupper()) or (text[0].isupper() and text[1].islower()):
return ""
return text
string = "DdCcaaBbbB"
pattern = r'([a-z])\1'
new_string = re.sub(pattern, replacer, string, flags=re.IGNORECASE)
There is a downside to this approach. Because the regex is matching case-insensitively, it won't let you test overlapping matches. So if you have an input string like 'BBbb', it will match the two capital Bs and the two lowercase bs and not replace either pair, and it won't check the the Bb pair in the middle.
Unfortunately I don't think regex can solve that problem, since it has no way to transform cases in the middle of its search. We're already a bit beyond the bounds of the most basic regular expression specifications, since we need to use a backreference to even get as far as we did.

Why is my regex returning 2 results if there is only one in the string? [duplicate]

This question already has answers here:
strange behavior of parenthesis in python regex
(3 answers)
Closed 4 years ago.
I am trying to extract an ID from a string with python3. The regex returns more then one item, despite only having one in the text:
text_total = 'Lore Ippsum Ref. 116519LN Perlmutt'
>>> re.findall(r"Ref\.? ?(([A-Z\d\.]+)|([\d.]+))", text_total)
[('116519LN', '116519LN', '')]
I am looking for a single trimed result, if possible without beeing a list anyway.
That's why my original line is:
[x for x in re.findall(r"Ref\.? ?(([A-Z\d\.]+)|([\d.]+))", text_total)][0]
The regex has an OR as I am also trying to match
Lore Ippsum Ref. 1166AB.39AZU2.123 Lore Ippsum
How can I retrieve just one result from the text and match both conditions?
Your groups inside your OR group, so to speak, are "capturing groups". You need to make them non capturing using the ?: syntax inside those groups, and allow the outer group to stay as a capturing group.
import re
text_total = 'Lore Ippsum Ref. 116519LN Perlmutt'
re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text_total)
#result ['116519LN']
Note that this still gets you multiple matches if there are many. You can use re.search for just first match.
You don't necessarily need an or, you can do Ref\.? ?([a-zA-Z. 0-9]+) (note the space at the end of the regex, it will be used as the ending for the match.
import re
pattern = r"Ref\.? ?([a-zA-Z. 0-9]+) "
text_total = "Lore Ippsum Ref. 116519LN Perlmutt"
results = re.findall(pattern, text_total)
print(results[0])

Regex substring one mismatch in any location of string

Can someone explain why the code below returns an empty list:
>>> import re
>>> m = re.findall("(SS){e<=1}", "PSSZ")
>>> m
[]
I am trying to find the total number of occurrences of SS (and incorporating the possibility of up to one mismatch) within PSSZ.
I saw a similar example of code here: Search for string allowing for one mismatch in any location of the string
You need to remove e<= chars present inside the range quantifier. Range quantifier must be of ,
{n} . Repeats the previous token n number of times.
{min,max} Repeats the previous token from min to max times.
It would be,
m = re.findall("(SS){1}", "PSSZ")
or
m = re.findall(r'SS','PSSZ')
Update:
>>> re.findall(r'(?=(S.|.S))', 'PSSZ')
['PS', 'SS', 'SZ']

Why empty string is on every string? [duplicate]

This question already has answers here:
Why is True returned when checking if an empty string is in another?
(5 answers)
Closed 5 years ago.
For example:
>>> s = 'python'
>>> s.index('')
0
>>> s.index('p')
0
This is because the substring of length 0 starting at index 0 in 'python' is equal to the empty string:
>>> s[0:0]
''
Of course every substring of length zero of any string is equal to the empty string.
You can see "python" as "the empty string, followed by a p, followed by fifteen more empty strings, followed by a y, followed by forty-two empty strings, ...".
Point being, empty strings don't take any space, so there's no reason why it should not be there.
The index method could be specified like this:
s.index(t) returns a value i such that s[i : i+len(t)] is equal to t
If you substitute the empty string for t, this reads: "returns a value i such that s[i:i] is equal to """. And indeed, the value 0 is a correct return value according to this specification.

Categories

Resources