Match same number of repetitions of character as repetitions of captured group - python

I would like to clean some input that was logged from my keyboard with python and regex.
Especially when backspace was used to fix a mistake.
Example 1:
[in]: 'Helloo<BckSp> world'
[out]: 'Hello world'
This can be done with
re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')
Example 2:
However when I have several backspaces, I don't know how to delete exactly the same number of characters before:
[in]: 'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'
(Here I want to remove 'l' and 'o' before the two backspaces).
I could simply use re.sub(r'[^>]<BckSp>', '', line) several times until there is no <BckSp> left but I would like to find a more elegant / faster solution.
Does anyone know how to do this ?

It looks like Python does not support recursive regex. If you can use another language, you could try this:
.(?R)?<BckSp>
See: https://regex101.com/r/OirPNn/1

It isn't very efficient but you can do that with the re module:
(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1
demo
This way you don't have to count, the pattern only uses the repetition.
(?:
[^<] # a character to remove
(?= # lookahead to reach the corresponding <BckSp>
[^<]* # skip characters until the first <BckSp>
( # capture group 1: contains the <BckSp>s
(?=(\1?))\2 # emulate an atomic group in place of \1?+
# The idea is to add the <BcKSp>s already matched in the
# previous repetitions if any to be sure that the following
# <BckSp> isn't already associated with a character
<BckSp> # corresponding <BckSp>
)
)
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>
\1 # matches all the consecutive <BckSp> and ensures that there's no more character
# between the last character to remove and the first <BckSp>
You can do the same with the regex module, but this time you don't need to emulate the possessive quantifier:
(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1
demo
But with the regex module, you can also use the recursion (as #Fallenhero noticed it):
[^<](?R)?<BckSp>
demo

Since there is no support for recursion/subroutine calls, no atomic groups/possessive quantifiers in Python re, you may remove these chars followed with backspaces in a loop:
import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s):
s = r.sub("", s)
print(s)
See the Python demo
The "^\b+|[^\b]\b" pattern will find 1+ backspace chars at the string start (with ^\b+) and [^\b]\b will find all non-overlapping occurrences of any char other than a backspace followed with a backspace.
Same approach in case a backspace is expressed as some enitity/tag like a literal <BckSp>:
import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s):
s = r.sub("", s)
print(s)
See another Python demo

Slightly verbose but you can use this lambda function to count # of <BckSp> occurrence and use substring routines to get your final output.
>>> bk = '<BckSp>'
>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world
>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world
>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word
>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work

In case the marker is single character you could just utilize stack which would give you the result in single pass:
s = "Helllo\b\bo world"
res = []
for c in s:
if c == '\b':
if res:
del res[-1]
else:
res.append(c)
print(''.join(res)) # Hello world
In case the marker is literally '<BckSp>' or some other string with length greater than 1 you can use replace to substitute it to '\b' and use the solution above. This only works if you know that '\b' doesn't occur in the input. If you can't designate a substitute character you could use split and process the results:
s = 'Helllo<BckSp><BckSp>o world'
res = []
for part in s.split('<BckSp>'):
if res:
del res[-1]
res.extend(part)
print(''.join(res)) # Hello world

Related

How to efficiently pass or ignore some tokens resolved by a python regex?

I am applying a function to a list of tokens as follows:
def replace(e):
return e
def foo(a_string):
l = []
for e in a_string.split():
l.append(replace(e.lower()))
return ' '.join(l)
With the string:
s = 'hi how are you today 23:i ok im good 1:i'
The function foo corrects the spelling of the tokens in s. However, there are some cases that I would like to ignore, for example 12:i or 2:i. How can I apply foo to all the tokens that are not resolved by the regex:\d{2}\b:i\b|\d{1}\b:i\b? That is, I would like that foo ignore all the tokens with the form 23:i or 01:e or 1:i. I was thinking on a regex, however, maybe there is a better way of doing this.
The expected output would be:
'hi how are you today 23:i ok im good 1:e'
In other words the function foo ignores tokens with the form nn:i or n:i, where n is a number.
You may use
import re
def replace(e):
return e
s = 'hi how are you today 23:i ok im good 1:e'
rx = r'(?<!\S)(\d{1,2}:[ie])(?!\S)|\S+'
print(re.sub(rx, lambda x: x.group(1) if x.group(1) else replace(x.group().lower()), s))
See the Python demo online and the regex demo.
The (?<!\S)(\d{1,2}:[ie])(?!\S)|\S+ pattern matches
(?<!\S)(\d{1,2}:[ie])(?!\S) - 1 or 2 digits, : and i or e that are enclosed with whitespaces or string start/end positions (with the substring captured into group 1)
| - or
\S+ - 1+ non-whitespace chars.
Once Group 1 matches, its value is pasted back as is, else, the lowercased match is passed to the replace method and the result is returned.
Another regex approach:
rx = r'(?<!\S)(?!\d{1,2}:[ie](?!\S))\S+'
s = re.sub(rx, lambda x: replace(x.group().lower()), s)
See another Python demo and a regex demo.
Details
(?<!\S) - checks if the char immediately to the left is a whitespace or asserts the string start position
(?!\d{1,2}:[ie](?!\S)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 1 or 2 digits, :, i or e, and then a whitespace or end of string should follow
\S+ - 1+ non-whitespace chars.
Try this:
s = ' '.join([i for i in s.split() if ':e' not in i])

Removing specific duplicated characters from a string in Python

How i can delete specific duplicated characters from a string only if they goes one after one in Python? For example:
A have string
string = "Hello _my name is __Alex"
I need to delete duplicate _ only if they goes one after one __ and get string like this:
string = "Hello _my name is _Alex"
If i use set i got this:
string = "_yoiHAemnasxl"
(Big edit: oops, I missed that you only want to de-deuplicate certain characters and not others. Retrofitting solutions...)
I assume you have a string that represents all the characters you want to de-duplicate. Let's call it to_remove, and say that it's equal to "_.-". So only underscores, periods, and hyphens will be de-duplicated.
You could use a regex to match multiple successive repeats of a character, and replace them with a single character.
>>> import re
>>> to_remove = "_.-"
>>> s = "Hello... _my name -- is __Alex"
>>> pattern = "(?P<char>[" + re.escape(to_remove) + "])(?P=char)+"
>>> re.sub(pattern, r"\1", s)
'Hello. _my name - is _Alex'
Quick breakdown:
?P<char> assigns the symbolic name char to the first group.
we put to_remove inside the character matching set, []. It's necessary to call re.escape because hyphens and other characters may have special meaning inside the set otherwise.
(?P=char) refers back to the character matched by the named group "char".
The + matches one or more repetitions of that character.
So in aggregate, this means "match any character from to_remove that appears more than once in a row". The second argument to sub, r"\1", then replaces that match with the first group, which is only one character long.
Alternative approach: write a generator expression that takes only characters that don't match the character preceding them.
>>> "".join(s[i] for i in range(len(s)) if i == 0 or not (s[i-1] == s[i] and s[i] in to_remove))
'Hello. _my name - is _Alex'
Alternative approach #2: use groupby to identify consecutive identical character groups, then join the values together, using to_remove membership testing to decide how many values should be added..
>>> import itertools
>>> "".join(k if k in to_remove else "".join(v) for k,v in itertools.groupby(s, lambda c: c))
'Hello. _my name - is _Alex'
Alternative approach #3: call re.sub once for each member of to_remove. A bit expensive if to_remove contains a lot of characters.
>>> for c in to_remove:
... s = re.sub(rf"({re.escape(c)})\1+", r"\1", s)
...
>>> s
'Hello. _my name - is _Alex'
Simple re.sub() approach:
import re
s = "Hello _my name is __Alex aa"
result = re.sub(r'(\S)\1+', '\\1', s)
print(result)
\S - any non-whitespace character
\1+ - backreference to the 1st parenthesized captured group (one or more occurrences)
The output:
Helo _my name is _Alex a

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

Stripping variable borders with python re

How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Categories

Resources