I'm trying to find the entire word exactly using regex but have the word i'm searching for be a variable value coming from user input. I've tried this:
regex = r"\b(?=\w)" + re.escape(user_input) + r"\b"
if re.match(regex, string_to_search[i], re.IGNORECASE):
<some code>...
but it matches every occurrence of the string. It matches "var"->"var" which is correct but also matches "var"->"var"iable and I only want it to match "var"->"var" or "string"->"string"
Input: "sword"
String_to_search = "There once was a swordsmith that made a sword"
Desired output: Match "sword" to "sword" and not "swordsmith"
You seem you want to use a pattern that matches an entire string. Note that \b word boundary is needed when you wan to find partial matches. When you need a full string match, you need anchors. Since re.match anchors the match at the start of string, all you need is $ (end of string position) at the end of the pattern:
regex = '{}$'.format(re.escape(user_input))
and then use
re.match(regex, search_string, re.IGNORCASE)
You can try re.finditer like that:
>>> import re
>>> user_input = "var"
>>> text = "var variable var variable"
>>> regex = r"(?=\b%s\b)" % re.escape(user_input)
>>> [m.start() for m in re.finditer(regex, text)]
[0, 13]
It'll find all matches iteratively.
Related
The problem is simple, I'm given a random string and a random pattern and I'm told to get all the posible combinations of that pattern that occur in the string and mark then with [target] and [endtarget] at the beggining and end.
For example:
given the following text: "XuyZB8we4"
and the following pattern: "XYZAB"
The expected output would be: "[target]X[endtarget]uy[target]ZB[endtarget]8we4".
I already got the part that identifies all the words, but I can't find a way of placing the [target] and [endtarget] strings after and before the pattern (called in the code match).
import re
def tagger(text, search):
place_s = "[target]"
place_f = "[endtarget]"
pattern = re.compile(rf"[{search}]+")
matches = pattern.finditer(text)
for match in matches:
print(match)
return test_string
test_string = "alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z "
pattern = "XYZAB"
print(tagger(test_string, pattern))
I also tried the for with the sub method, but I couldn't get it to work.
for match in matches:
re.sub(match.group(0), place_s + match.group(0) + place_f, text)
return text
re.sub allows you to pass backreferences to matched groups within your pattern. so you do need to enclose your pattern in parentheses, or create a named group, and then it will replace all matches in the entire string at once with your desired replacements:
In [10]: re.sub(r'([XYZAB]+)', r'[target]\1[endtarget]', test_string)
Out[10]: 'alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] '
With this approach, re.finditer is not not needed at all.
How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.
I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.
I was wondering if it's possible to use regex with python to capture a word, or a part of the word (if it's at the end of the string).
Eg:
target word - potato
string - "this is a sentence about a potato"
string - "this is a sentence about a potat"
string - "this is another sentence about a pota"
Thanks!
import re
def get_matcher(word, minchars):
reg = '|'.join([word[0:i] for i in range(len(word), minchars - 1, -1)])
return re.compile('(%s)$' % (reg))
matcher = get_matcher('potato', 4)
for s in ["this is a sentence about a potato", "this is a sentence about a potat", "this is another sentence about a pota"]:
print matcher.search(s).groups()
OUTPUT
('potato',)
('potat',)
('pota',)
Dont know how to match a regex in python, but the regex would be:
"\bp$|\bpo$|\bpot$|\bpota$|\bpotat$|\bpotato$"
This would match anything from p to potato if its the last word in the string, and also for example not something like "foopotato", if this is what you want.
The | denotes an alternative, the \b is a "word boundary", so it matches a position (not a character) between a word- and a non-word character. And the $ matches the end of the string (also a position).
Use the $ to match at the end of a string. For example, the following would match 'potato' only at the end of a string (first example):
"potato$"
This would match all of your examples:
"pota[to]{1,2}$"
However, some risk of also matching "potao" or "potaot".
import re
patt = re.compile(r'(p|po|pot|pota|potat|potato)$')
patt.search(string)
I was tempted to use r'po?t?a?t?o?$', but that would also match poto or pott.
No, you can't do that with a regex as far as I know, without pointless (p|po|pot ...) matches which are excessive. Instead, just pick off the last word, and match that using a substring:
match = re.search('\S+$', haystack)
if match.group(0) == needle[:len(match.group(0))]:
# matches.