How to use regex to only keep first n repeated words - python

If I have an input sentence
input = 'ok ok, it is very very very very very hard'
and what I want to do is to only keep the first three replica for any repeated word:
output = 'ok ok, it is very very very hard'
How can I achieve this with re or regex module in python?

One option could be to use a capturing group with a backreference and use that in the replacement.
((\w+)(?: \2){2})(?: \2)*
Explanation
( Capture group 1
(\w+) capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b)
(?: \2){2} Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+ to match 1+ spaces or tabs or use \s+ to match 1+ whitespace chars. (Note that that would also match a newline)
) Close group 1
(?: \2)* Match 0+ times a space and a backreference to group 2 to match the same words that you want to remove
Regex demo | Python demo
For example
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
Result
ok ok, it is very very very hard

You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
This outputs:
ok ok, it is very very very hard

One solution with re.sub with custom function:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
Prints:
ok ok, it is very very very hard

Related

Extract substring from a python string

I want to extract the string before the 9 digit number below:
tmp = place1_128017000_gw_cl_mask.tif
The output should be place1
I could do this:
tmp.split('_')[0] but I also want the solution to work for:
tmp = place1_place2_128017000_gw_cl_mask.tif where the result would be:
place1_place2
You can assume that the number will also be 9 digits long
Using regular expressions and the lookahead feature of regex, this is a simple solution:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'.+(?=_\d{9}_)', tmp)
print(m.group())
Result:
place1_place2
Note that the \d{9} bit matches exactly 9 digits. And the bit of the regex that is in (?= ... ) is a lookahead, which means it is not part of the actual match, but it only matches if that follows the match.
Assuming we can phrase your problem as wanting the substring up to, but not including the underscore which is followed by all numbers, we can try:
tmp = "place1_place2_128017000_gw_cl_mask.tif"
m = re.search(r'^([^_]+(?:_[^_]+)*)_\d+_', tmp)
print(m.group(1)) # place1_place2
Use a regular expression:
import re
places = (
"place1_128017000_gw_cl_mask.tif",
"place1_place2_128017000_gw_cl_mask.tif",
)
pattern = re.compile("(place\d+(?:_place\d+)*)_\d{9}")
for p in places:
matched = pattern.match(p)
if matched:
print(matched.group(1))
prints:
place1
place1_place2
The regex works like this (adjust as needed, e.g., for less than 9 digits or a variable number of digits):
( starts a capture
place\d+ matches "places plus 1 to many digits"
(?: starts a group, but does not capture it (no need to capture)
_place\d+ matches more "places"
) closes the group
* means zero or many times the previous group
) closes the capture
\d{9} matches 9 digits
The result is in the first (and only) capture group.
Here's a possible solution without regex (unoptimized!):
def extract(s):
result = ''
for x in s.split('_'):
try: x = int(x)
except: pass
if isinstance(x, int) and len(str(x)) == 9:
return result[:-1]
else:
result += x + '_'
tmp = 'place1_128017000_gw_cl_mask.tif'
tmp2 = 'place1_place2_128017000_gw_cl_mask.tif'
print(extract(tmp)) # place1
print(extract(tmp2)) # place1_place2

Can't get regex patterns right

I made a function that replaces multiple instances of a single character with multiple patterns depending on the character location.
There were two ways I found to accomplish this:
This one looks horrible but it works:
def xSubstitution(target_string):
while target_string.casefold().find('x') != -1:
x_finded = target_string.casefold().find('x')
if (x_finded == 0 and target_string[1] == ' ') or (target_string[x_finded-1] == ' ' and
((target_string[-1] == 'x' or 'X') or target_string[x_finded+1] == ' ')):
target_string = target_string.replace(target_string[x_finded], 'ecks', 1)
elif (target_string[x_finded+1] != ' '):
target_string = target_string.replace(target_string[x_finded], 'z', 1)
else:
target_string = target_string.replace(target_string[x_finded], 'cks', 1)
return(target_string)
This one technically works, but I just can't get the regex patterns right:
import re
def multipleRegexSubstitutions(sentence):
patterns = {(r'^[xX]\s'): 'ecks ', (r'[^\w]\s?[xX](?!\w)'): 'ecks',
(r'[\w][xX]'): 'cks', (r'[\w][xX][\w]'): 'cks',
(r'^[xX][\w]'): 'z',(r'\s[xX][\w]'): 'z'}
regexes = [
re.compile(p)
for p in patterns
]
for regex in regexes:
for match in re.finditer(regex, sentence):
match_location = sentence.casefold().find('x', match.start(), match.end())
sentence = sentence.replace(sentence[match_location], patterns.get(regex.pattern), 1)
return sentence
From what I figured it out, the only problem in the second function is the regex patterns. Could someone help me?
EDIT: Sorry I forgot to tell that the regexes are looking for the different x characters in a string, and replace an X in the beggining of a word for a 'Z', in the middle or end of a word for 'cks' and if it is a lone 'x' char replace with 'ecks'
You need \b (word boundary) and \B (position other than word boundary):
Replace an X in the beggining of a word for a 'Z'
re.sub(r'\bX\B', 'Z', s, flags=re.I)
In the middle or end of a word for 'cks'
re.sub(r'\BX', 'cks', s, flags=re.I)
If it is a lone 'x' char replace with 'ecks'
re.sub(r'\bX\b', 'ecks', s, flags=re.I)
I would just use the following set of substitutions for this:
string = re.sub(r"\b[Xx]\b", "ecks", string)
string = re.sub(r"\b[Xx](?!\s)", "Z", string)
string = re.sub(r"(?<=\w)[Xx](?=\w)", "cks", string)
Here,
(?!\s)
just asserts that the regex does not match any whitespace character,
\b
Edit: The last regex would also match x or X at the beginning of a word. So we can use the following, instead,
(?<=\w)[xX](?=\w)
to make sure there must be a character \w before/after x or X.

Replace subtext of a word

I want to replace this string
ramesh#gmail.com
to
rxxxxh#gxxxl.com
this is what I have done so far
print( re.sub(r'([A-Za-z](.*)[A-Za-z]#)','x', i))
One way to go is to use capturing groups and in the replacement for the parts that should be replaced with x return a repetition for number of characters in the matched group.
For the second and the fourth group use a negated character class [^ matching any char except the listed.
\b([A-Za-z])([^#\s]*)([A-Za-z]#[A-Za-z])([^#\s.]*)([A-Za-z])\b
Regex demo | Python demo
For example
import re
i = "ramesh#gmail.com"
res = re.sub(
r'\b([A-Za-z])([^#\s]*)([A-Za-z]#[A-Za-z])([^#\s.]*)([A-Za-z])\b',
lambda x: x.group(1) + "x" * len(x.group(2)) + x.group(3) + "x" * len(x.group(4)) + x.group(5),
i)
print(res)
Output
rxxxxh#gxxxl.com

Finding all occurrences + substrings of a word

I have the 'main' word, "LAUNCHER", and 2 other words, "LAUNCH" and "LAUNCHER". I want to find out (using regex), which words are in the 'main' word. I'm using findAll, with the regex: "(LAUNCH)|(LAUNCHER)" , but this will only return LAUNCH and not both of them. How do i fix this?
import re
mainword = "launcher"
words = "(launch|launcher)"
matches = re.findall(words,mainword)
for match in matches:
print(match)
you can try something like this:
import re
mainword = "launcher"
words = "(launch|launcher)"
for x in (re.findall(r"[A-Za-z##]+|\S", words)):
if x in mainword:
print (x)
result:
launch
launcher
If you're not required to use regular expressions, this would be done more efficiently with the IN operator and a simple loop or list comprehension:
mainWord = "launcher"
words = ["launch","launcher"]
matches = [ word for word in words if word in mainWord ]
# case insensitive...
matchWord = mainWord.lower()
matches = [ word for word in words if word.lower() in matchWord ]
Even if you do require regex, a loop would be needed because re.findAll() never matches overlapping patterns :
import re
pattern = re.compile("launcher|launch")
mainWord = "launcher"
matches = []
startPos = 0
lastMatch = None
while startPos < len(mainWord):
if lastMatch : match = pattern.match(mainWord,lastMatch.start(),lastMatch.end()-1)
else : match = pattern.match(mainWord,startPos)
if not match:
if not lastMatch : break
startPos = lastMatch.start() + 1
lastMatch = None
continue
matches.append(mainWord[match.start():match.end()])
lastMatch = match
print(matches)
note that, even with this loop, you need to have the longer words appear before shorter ones if you use the | operator in the regular expression. This is because | is never greedy and will match the first word, not the longest one.

How to perform a transform on a matched group in a substitution in python [duplicate]

I try to put this steps in one, but it doesnt work
w = re.sub('[0-9]', r'9', w)
w = re.sub('[A-Z]', r'X', w)
w = re.sub('[a-z]', r'x', w)
Does anybody knows how to make from such strings as XXxxxx999 --> Xx9.
You may use a callback method as a replacement argument like this:
import re
rx = r'([0-9]+)|([A-Z]+)|[a-z]+'
w = "XXxxxx999"
def repl(m):
if m.group(1): # if ([0-9]) matched
return '9' # replace with 9
elif m.group(2): # if ([A-Z]) matched
return 'X' # replace with X
else: # if ([a-z]) matched
return 'x' # replace with x
print(re.sub(rx, repl, w)) # => Xx9
See the Python demo.
The regex matches:
([0-9]+) - Group 1: 1+ digits
| - or
([A-Z]+) - Group 2: 1+ uppercase letters
| - or
[a-z]+ - 1+ lowercase letters.

Categories

Resources