Python re.sub() is not replacing every match

Python re.sub() is not replacing every match - python

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO

Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.

The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Related

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.

When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']

import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.

If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.

You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.

To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4

This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

The difference between ( [^,]) and (.,) in regular expression? Using python

When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?

In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.

as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]

Print everything after a few specific characters in a string

My intention was use the index method to search for either a colon (:) or a equal sign (=) in the string and print everything after that character but I realized it's not syntactically possible as it's written below with the OR statement. So is there another way to write this piece of code? (I wasn't able to come up with a simple way to write this without getting into loops and if statements)
l='Name = stack'
pos=l.index(':' or '=')
print (' '.join(l[pos+1:-1].split())) #this just gets rid of the whitespaces

Assuming your example as above, the long way (explanation of each piece below):
pos = max(l.find(':'), l.find('='), 0)
print(l[pos:].strip())
Here's a way to shorten it to one line, with an explanation of each part in the order it's evaluated in.
print(l[max(l.find(':'),l.find('='),0):].strip())
#--------------- Breakdown
# max -> highest of values; find returns -1 if it isn't there.
# using a 0 at the end means if ':'/'=' aren't in the string, print the whole thing.
# l.find(),l.find() -> check the two characters, using the higher due to max()
# l[max():] -> use that higher value until the end (implied with empty :])
# .strip() -> remove whitespace

import re
l='Name = stack'
print(re.split(':|=', l)[-1])
Regular expression split on either character, then take the last result.
You didn't mention if there was guaranteed to be one or the other separator and not both, always a separator, not more than one separator... this might not do what you want, depending.

You should limit the number of splits to one, using maxsplit in re.split():
import re
s1 = 'name1 = x1 and noise:noise=noise'
s2 = 'name2: x2 and noise:noise=noise'
print(re.split(':|=', s1, maxsplit=1)[-1].strip())
print(re.split(':|=', s2, maxsplit=1)[-1].strip())
Output:
x1 and noise:noise=noise
x2 and noise:noise=noise

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!

Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.

you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python re.sub() is not replacing every match - python

It can be solved without regular expression, like below >>>''.join([i for i in s1 if s1.count(i) == 1]) 'bc' >>>''.join([i for i in s if s.count(i) == 1]) 'c'

Related

How do you find all instances of a substring, followed by a certain number of dynamic characters?

Pythonic way to find the last position in a string matching a negative regex

The difference between ( [^,]) and (.,) in regular expression? Using python

Print everything after a few specific characters in a string

Python Regular Expressions Findall

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python re.sub() is not replacing every match - python

It can be solved without regular expression, like below >>>''.join([i for i in s1 if s1.count(i) == 1]) 'bc' >>>''.join([i for i in s if s.count(i) == 1]) 'c'

Related

How do you find all instances of a substring, followed by a certain number of dynamic characters?

Pythonic way to find the last position in a string matching a negative regex

The difference between ( [^,]*) and (.*,) in regular expression? Using python

Print everything after a few specific characters in a string

Python Regular Expressions Findall

Categories

Resources

The difference between ( [^,]) and (.,) in regular expression? Using python