Regex with customized word boundaries in Python - python

I'm using a function called findlist to return a list of all the positions of a certain string within a text, with regex to look for word boundaries. But I want to ignore the character ( and only consider the other word boundaries, so that it will find split in var split but not in split(a). Is there any way to do this?
import re
def findlist(input, place):
return [m.span() for m in re.finditer(input, place)]
str = '''
var a = 'a b c'
var split = a.split(' ')
'''
instances = findlist(r"\b%s\b" % ('split'), str)
print(instances)

You may check if there is a ( after the trailing word boundary with a negative lookahead (?!\():
instances = findlist(r"\b{}\b(?!\()".format('split'), s)
^^^^^^
The (?!\() will trigger after the whole word is found, and if there is a ( immediately to the right of the found word, the match will be failed.
See the Python demo:
import re
def findlist(input_data, place):
return [m.span() for m in re.finditer(input_data, place)]
s = '''
var a = 'a b c'
var split = a.split(' ')
'''
instances = findlist(r"\b{}\b(?!\()".format('split'), s)
print(instances) # => [(21, 26)]

Related

efficient way to split multi-word hashtag in python

Given a text like
THIS is a #hashtag and this is a #multiWordHashtag
I need to output
THIS is a hashtag and this is a multi Word Hashtag
For now, I use this function:
def do_process_eng_hashtag(input_text: str):
result = []
for word in input_text.split():
if word.startswith('#') and len(word) > 1:
word = list(word)
word[1] = word[1].upper()
word = ''.join(word)
word = ' '.join(re.findall('[A-Z][^A-Z]*', word))
result.append(word)
return ' '.join(result)
But I wonder if there is a more efficient and neat way to do so?
Using re.sub:
You can specify replacement function:
def do_process_eng_hashtag(input_text: str) -> str:
return re.sub(
r'#[a-z]\S*',
lambda m: ' '.join(re.findall('[A-Z][^A-Z]*|[a-z][^A-Z]*', m.group().lstrip('#'))),
input_text,
)
The replacement function (lambda) will split hash tag into multiple words:
>>> re.findall('[A-Z][^A-Z]*|[a-z][^A-Z]*', '#multiWordHashtag'.lstrip('#'))
['multi', 'Word', 'Hashtag']
>>> do_process_eng_hashtag('THIS is a #hashtag and this is a #multiWordHashtag')
'THIS is a hashtag and this is a multi Word Hashtag '
You can use a function with re.sub like so:
import re
example='THIS is a #hashtag and this is a #multiWordHashtag'
def rep(m):
s=m.group(1)
return ' '.join(re.split(r'(?=[A-Z])', s))
>>> re.sub(r'#(\w+)', rep, example)
THIS is a hashtag and this is a multi Word Hashtag
Works like this:
re.sub(r'#(\w+)', rep, example) calls rep function with a match group for all hashtags.
The rep function then uses a lookahead to split the string on capitalization:
>>> re.split(r'(?=[A-Z])','multiWordHashtag')
['multi', 'Word', 'Hashtag']
>>> re.split(r'(?=[A-Z])','hastag')
['hastag']
The ' '.join() adds space delimiters. If there are no capitals, (ie, the argument to join is a list of length 1), just the string is returned.
You can modify the regex in re.sub(r'#(\w+)', rep, example) to whatever YOU consider a hashtag. Perhaps re.sub(r'#([a-zA-Z]+)', rep, example)?
Alternatively, you can combine Python splitting with the same regex to detect upper case:
def word_func(s):
return ' '.join(re.split(r'(?=[A-Z])', s[1:]))
' '.join([word_func(s) if s.startswith('#') else s for s in example.split()])
# same output

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.
A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g
If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string
You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

In python, how to 'if finditer(...) has no matches'?

I would like to do something when finditer() does not find anything.
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
# <if matched_iter is empty (no matched found>.
# do something.
# else
for m in matched_iter:
print m.group()
The best thing I could come up with is to keep track of found manually:
mi_no_find = re.finditer(r'\w+',"$$%%%%") # not matching.
found = False
for m in mi_no_find:
print m.group()
found = True
if not found:
print "Nothing found"
Related posts that don't answer:
Counting finditer matches: Number of regex matches (I don't need to count, I just need to know if there are no matches).
finditer vs match: different behavior when using re.finditer and re.match (says always have to loop over an iterator returned by finditer)
[edit]
- I have no interest in enumerating or counting total output. Only if found else not found actions.
- I understand I can put finditer into a list, but this would be inefficient for large strings. One objective is to have low memory utilization.
Updated 04/10/2020
Use re.search(pattern, string) to check if a pattern exists.
pattern = "1"
string = "abc"
if re.search(pattern, string) is None:
print('do this because nothing was found')
Returns:
do this because nothing was found
If you want to iterate over the return, then place the re.finditer() within the re.search().
pattern = '[A-Za-z]'
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
Returns:
Found this thing: a
Found this thing: b
Found this thing: c
Therefore, if you wanted both options, use the else: clause with the if re.search() conditional.
pattern = "1"
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
else:
print('do this because nothing was found')
Returns:
do this because nothing was found
previous reply below (not sufficient, just read above)
If the .finditer() does not match a pattern, then it will not perform any commands within the related loop.
So:
Set the variable before the loop you are using to iterate over the regex returns
Call the variable after (And outside of) the loop you are using to iterate over the regex returns
This way, if nothing is returned from the regex call, the loop won't execute and your variable call after the loop will return the exact same variable it was set to.
Below, example 1 demonstrates the regex finding the pattern. Example 2 shows the regex not finding the pattern, so the variable within the loop is never set.
Example 3 shows my suggestion - where the variable is set before the regex loop, so if the regex does not find a match (and subsequently, does not trigger the loop), the variable call after the loop returns the initial variable set (Confirming the regex pattern was not found).
Remember to import the import re module.
EXAMPLE 1 (Searching for the characters 'he' in the string 'hello world' will return 'he')
my_string = 'hello world'
pat = '(he)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# returns 'he'
EXAMPLE 2 (Searching for the characters 'ab' in the string 'hello world' do not match anything, so the 'for a in regex:' loop does not execute and does not assign the b variable any value.)
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# no return
EXAMPLE 3 (Searching for the characters 'ab' again, but this time setting the variable b to 'CAKE' before the loop, and calling the variable b after, outside of the loop returns the initial variable - i.e. 'CAKE' - since the loop did not execute).
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
print(b) # calls the variable after (and outside) the loop
# returns 'CAKE'
It's also worth noting that when designing your pattern to feed into the regex, make sure to use the parenthesis to indicate the start and end of a group.
pattern = '(ab)' # use this
pattern = 'ab' # avoid using this
To tie back to the initial question:
Since nothing found won’t execute the for loop (for a in regex), the user can preload the variable, then check it after the for loop for the original loaded value. This will allow for the user to know if nothing was found.
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
if b == ‘CAKE’:
# action taken if nothing is returned
If performance isn't an issue, simply use findall or list(finditer(...)), which returns a list.
Otherwise, you can "peek" into the generator with next, then loop as normal if it raises StopIteration. Though there are other ways to do it, this is the simplest to me:
import itertools
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
try:
first_match = next(matched_iter)
except StopIteration:
print("No match!") # action for no match
else:
for m in itertools.chain([first_match], matched_iter):
print(m.group())
You can probe the iterator with next and then chain the results back together while excepting StopIteration which means the iterator was empty:
import itertools as it
matches = iter([])
try:
probe = next(matches)
except StopIteration:
print('empty')
else:
for m in it.chain([probe], matches):
print(m)
Regarding your solution you could check m directly, setting it to None beforehand:
matches = iter([])
m = None
for m in matches:
print(m)
if m is None:
print('empty')
It prints the original string if there are no matches in the string.
It will replace the position n of the string.
For more reference: https://docs.python.org/2/howto/regex.html
Input_Str = "FOOTBALL"
def replacing(Input_String, char_2_replace, replaced_char, n):
pattern = re.compile(char_2_replace)
if len(re.findall(pattern, Input_String)) >= n:
where = [m for m in pattern.finditer(Input_String)][n-1]
before = Input_String[:where.start()]
after = Input_String[where.end():]
newString = before + replaced_char + after
else:
newString = Input_String
return newString
print(replacing(Input_Str, 'L', 'X', 4))```
I know this answer is late, but very suitable for Python 3.8+
You can use the new warlus operator := operator along with next(iterator[, default]) to solve for 'no matches' in re.finditer(pattern, string, flags=0) somewhat like this:
import re
pattern_ = "1"
string_ = "abc"
def is_match():
was_found = False
while next((match := re.finditer(pattern_, string_)), None) is not None:
was_found = True
yield match.group() # or just print it
return was_found

Combinatorial product of regex substitutions

I am trying to produce string variants by applying substitutions optionally.
For example, one substitution scheme is removing any sequence of blank characters.
Rather than replacing all occurrences like
>>> re.sub(r'\s+', '', 'a b c')
'abc'
– I need, instead, two variants to be produced for each occurrence, in that the substitution is performed in one variant, but not in the other.
For the string 'a b c'
I want to have the variants
['a b c', 'a bc', 'ab c', 'abc']
ie. the cross product of all binary decisions (the result obviously includes the original string).
For this case, the variants can be produced using re.finditer and itertools.product:
def vary(target, pattern, subst):
occurrences = [m.span() for m in pattern.finditer(target)]
for path in itertools.product((True, False), repeat=len(occurrences)):
variant = ''
anchor = 0
for (start, end), apply_this in zip(occurrences, path):
if apply_this:
variant += target[anchor:start] + subst
anchor = end
variant += target[anchor:]
yield variant
This produces the desired output for the above example:
>>> list(vary('a b c', re.compile(r'\s+'), ''))
['abc', 'ab c', 'a bc', 'a b c']
However, this solution only works for fixed-string replacements.
Advanced features from re.sub like group references can't be done like that,
as in the following example for inserting a space after a sequence of digits inside a word:
re.sub(r'\B(\d+)\B'), r'\1 ', 'abc123def')
How can the approach be extended or changed to accept any valid argument to re.sub
(without writing a parser for interpreting group references)?
Thinking about making subst a callable that gets access to match data finally made me learn about MatchObject.expand. So, as an approximation, with subst staying an r string,
def vary(target, pattern, subst):
matches = [m for m in pattern.finditer(target)]
occurrences = [m.span() for m in matches]
for path in itertools.product((True, False), repeat=len(occurrences)):
variant = ''
anchor = 0
for match, (start, end), apply_this in zip(matches, occurrences, path):
if apply_this:
variant += target[anchor:start] + match.expand(subst)
anchor = end
variant += target[anchor:]
yield variant
I am not sure, though, that this covers all needed flexibility in referring to the subject string, being bount to the corresponding match. An indexed power set of the split string came to mind, but I guess that's not far from the parser mentioned.
How about this:
def vary(target, pattern, subst):
numOccurences = len (pattern.findall (target))
for path in itertools.product((True, False), repeat=numOccurences):
variant = ''
remainingStr = target
for currentFlag in path:
if currentFlag:
remainingStr = pattern.sub (subst, remainingStr, 1)
else:
currentMatch = pattern.search (remainingStr);
variant += remainingStr[:currentMatch.end ()]
remainingStr = remainingStr[currentMatch.end ():]
variant += remainingStr
yield variant
For each match, we either let re.sub() do its job (with a count of 1 to stop after one substitution), or we snatch away the unchanged portion of the string.
Trying it out with your examples like this
target = 'a b c'
pattern = re.compile(r'\s+')
subst = ''
print list (vary(target, pattern, subst))
target = 'abc123def'
pattern = re.compile(r'\B(\d+)\B')
subst = r'\1 '
print list (vary(target, pattern, subst))
I get
['abc', 'ab c', 'a bc', 'a b c']
['abc123 def', 'abc123def']

How to find and replace nth occurrence of word in a sentence using python regular expression?

Using python regular expression only, how to find and replace nth occurrence of word in a sentence?
For example:
str = 'cat goose mouse horse pig cat cow'
new_str = re.sub(r'cat', r'Bull', str)
new_str = re.sub(r'cat', r'Bull', str, 1)
new_str = re.sub(r'cat', r'Bull', str, 2)
I have a sentence above where the word 'cat' appears two times in the sentence. I want 2nd occurence of the 'cat' to be changed to 'Bull' leaving 1st 'cat' word untouched. My final sentence would look like:
"cat goose mouse horse pig Bull cow". In my code above I tried 3 different times could not get what I wanted.
Use negative lookahead like below.
>>> s = "cat goose mouse horse pig cat cow"
>>> re.sub(r'^((?:(?!cat).)*cat(?:(?!cat).)*)cat', r'\1Bull', s)
'cat goose mouse horse pig Bull cow'
DEMO
^ Asserts that we are at the start.
(?:(?!cat).)* Matches any character but not of cat , zero or more times.
cat matches the first cat substring.
(?:(?!cat).)* Matches any character but not of cat , zero or more times.
Now, enclose all the patterns inside a capturing group like ((?:(?!cat).)*cat(?:(?!cat).)*), so that we could refer those captured chars on later.
cat now the following second cat string is matched.
OR
>>> s = "cat goose mouse horse pig cat cow"
>>> re.sub(r'^(.*?(cat.*?){1})cat', r'\1Bull', s)
'cat goose mouse horse pig Bull cow'
Change the number inside the {} to replace the first or second or nth occurrence of the string cat
To replace the third occurrence of the string cat, put 2 inside the curly braces ..
>>> re.sub(r'^(.*?(cat.*?){2})cat', r'\1Bull', "cat goose mouse horse pig cat foo cat cow")
'cat goose mouse horse pig cat foo Bull cow'
Play with the above regex on here ...
I use simple function, which lists all occurrences, picks the nth one's position and uses it to split original string into two substrings. Then it replaces first occurrence in the second substring and joins substrings back into the new string:
import re
def replacenth(string, sub, wanted, n):
where = [m.start() for m in re.finditer(sub, string)][n-1]
before = string[:where]
after = string[where:]
newString = before + after.replace(sub, wanted, 1)
print newString
For these variables:
string = 'ababababababababab'
sub = 'ab'
wanted = 'CD'
n = 5
outputs:
ababababCDabababab
Notes:
The where variable actually is a list of matches' positions, where you pick up the nth one. But list item index starts with 0 usually, not with 1. Therefore there is a n-1 index and n variable is the actual nth substring. My example finds 5th string. If you use n index and want to find 5th position, you'll need n to be 4. Which you use usually depends on the function, which generates our n.
This should be the simplest way, but it isn't regex only as you originally wanted.
Sources and some links in addition:
where construction: How to find all occurrences of a substring?
string splitting: https://www.daniweb.com/programming/software-development/threads/452362/replace-nth-occurrence-of-any-sub-string-in-a-string
similar question: Find the nth occurrence of substring in a string
Here's a way to do it without a regex:
def replaceNth(s, source, target, n):
inds = [i for i in range(len(s) - len(source)+1) if s[i:i+len(source)]==source]
if len(inds) < n:
return # or maybe raise an error
s = list(s) # can't assign to string slices. So, let's listify
s[inds[n-1]:inds[n-1]+len(source)] = target # do n-1 because we start from the first occurrence of the string, not the 0-th
return ''.join(s)
Usage:
In [278]: s
Out[278]: 'cat goose mouse horse pig cat cow'
In [279]: replaceNth(s, 'cat', 'Bull', 2)
Out[279]: 'cat goose mouse horse pig Bull cow'
In [280]: print(replaceNth(s, 'cat', 'Bull', 3))
None
I would define a function that will work for every regex:
import re
def replace_ith_instance(string, pattern, new_str, i = None, pattern_flags = 0):
# If i is None - replacing last occurrence
match_obj = re.finditer(r'{0}'.format(pattern), string, flags = pattern_flags)
matches = [item for item in match_obj]
if i == None:
i = len(matches)
if len(matches) == 0 or len(matches) < i:
return string
match = matches[i - 1]
match_start_index = match.start()
match_len = len(match.group())
return '{0}{1}{2}'.format(string[0:match_start_index], new_str, string[match_start_index + match_len:])
A working example:
str = 'cat goose mouse horse pig cat cow'
ns = replace_ith_instance(str, 'cat', 'Bull', 2)
print(ns)
The output:
cat goose mouse horse pig Bull cow
Another example:
str2 = 'abc abc def abc abc'
ns = replace_ith_instance(str2, 'abc\s*abc', '666')
print(ns)
The output:
abc abc def 666
How to replace the nth needle with word:
s.replace(needle,'$$$',n-1).replace(needle,word,1).replace('$$$',needle)
You can match the two occurrences of "cat", keep everything before the second occurrence (\1) and add "Bull":
new_str = re.sub(r'(cat.*?)cat', r'\1Bull', str, 1)
We do only one substitution to avoid replacing the fourth, sixth, etc. occurrence of "cat" (when there are at least four occurrences), as pointed out by Avinash Raj comment.
If you want to replace the n-th occurrence and not the second, use:
n = 2
new_str = re.sub('(cat.*?){%d}' % (n - 1) + 'cat', r'\1Bull', str, 1)
BTW you should not use str as a variable name since it is a Python reserved keyword.
Create a repl function to pass into re.sub(). Except... the trick is to make it a class so you can track the call count.
class ReplWrapper(object):
def __init__(self, replacement, occurrence):
self.count = 0
self.replacement = replacement
self.occurrence = occurrence
def repl(self, match):
self.count += 1
if self.occurrence == 0 or self.occurrence == self.count:
return match.expand(self.replacement)
else:
try:
return match.group(0)
except IndexError:
return match.group(0)
Then use it like this:
myrepl = ReplWrapper(r'Bull', 0) # replaces all instances in a string
new_str = re.sub(r'cat', myrepl.repl, str)
myrepl = ReplWrapper(r'Bull', 1) # replaces 1st instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)
myrepl = ReplWrapper(r'Bull', 2) # replaces 2nd instance in a string
new_str = re.sub(r'cat', myrepl.repl, str)
I'm sure there is a more clever way to avoid using a class, but this seemed straight-forward enough to explain. Also, be sure to return match.expand() as just returning the replacement value is not technically correct of someone decides to use \1 type templates.
I approached this by generating a 'grouped' version of the desired catch pattern relative to the entire string, then applying the sub directly to that instance.
The parent function is regex_n_sub, and collects the same inputs as the re.sub() method.
The catch pattern is passed to get_nsubcatch_catch_pattern() with the instance number. Inside, a list comprehension generates multiples of a pattern '.*? (Match any character, 0 or more repetitions, non-greedy). This pattern will be used to represent the space between pre-nth occurrences of the catch_pattern.
Next, the input catch_pattern is placed between each nth of the 'space pattern' and wrapped with parentheses to form the first group.
The second group is just the catch_pattern wrapped in parentheses - so when the two groups are combined, a pattern for, 'all of the text up to the nth occurrence of the catch pattern is created. This 'new_catch_pattern' has two groups built in, so the second group containing the nth occurence of the catch_pattern can be substituted.
The replace pattern is passed to get_nsubcatch_replace_pattern() and combined with the prefix r'\g<1>' forming a pattern \g<1> + replace_pattern. The \g<1> part of this pattern locates group 1 from the catch pattern, and replaces that group with the text following in the replace pattern.
The code below is verbose only for a clearer understanding of the process flow; it can be reduced as desired.
--
The example below should run stand-alone, and corrects the 4th instance of "I" to "me":
"When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."
with
"When I go to the park and I am alone I think the ducks laugh at me but I'm not sure."
import regex as re
def regex_n_sub(catch_pattern, replace_pattern, input_string, n, flags=0):
new_catch_pattern, new_replace_pattern = generate_n_sub_patterns(catch_pattern, replace_pattern, n)
return_string = re.sub(new_catch_pattern, new_replace_pattern, input_string, 1, flags)
return return_string
def generate_n_sub_patterns(catch_pattern, replace_pattern, n):
new_catch_pattern = get_nsubcatch_catch_pattern(catch_pattern, n)
new_replace_pattern = get_nsubcatch_replace_pattern(replace_pattern, n)
return new_catch_pattern, new_replace_pattern
def get_nsubcatch_catch_pattern(catch_pattern, n):
space_string = '.*?'
space_list = [space_string for i in range(n)]
first_group = catch_pattern.join(space_list)
first_group = first_group.join('()')
second_group = catch_pattern.join('()')
new_catch_pattern = first_group + second_group
return new_catch_pattern
def get_nsubcatch_replace_pattern(replace_pattern, n):
new_replace_pattern = r'\g<1>' + replace_pattern
return new_replace_pattern
### use test ###
catch_pattern = 'I'
replace_pattern = 'me'
test_string = "When I go to the park and I am alone I think the ducks laugh at I but I'm not sure."
regex_n_sub(catch_pattern, replace_pattern, test_string, 4)
This code can be copied directly into a workflow, and will return the replaced object to the regex_n_sub() function call.
Please let me know if implementation fails!
Thanks!
Just because none of the current answers fitted what I needed: based on aleskva's one:
import re
def replacenth(string, pattern, replacement, n):
assert n != 0
matches = list(re.finditer(pattern, string))
if len(matches) < abs(n) :
return string
m = matches[ n-1 if n > 0 else len(matches) + n]
return string[0:m.start()] + replacement + string[m.end():]
It accepts negative match numbers ( n = -1 will return the last match), any regex pattern, and it's efficient. If the there are few than n matches, the original string is returned.

Categories

Resources