Regex to find strings containing substring, but not ending on same substring - python

I'm trying to write a regex that checks if a string contains the substring "ing", but it most not end on "ing".
So the word sing would not work but singer would.
I think I have figured out how to make sure that the string does not end with ing, for that I'm using
(!<?(ing))$
But I can't seem to get it to work when I want the word to contain "ing" as well. I was thinking something like
(\w+(ing))(!<?(ing))$
But that does not work, all of my solution that sort of makes it work will take in more than one word as well. So it will match singer but not singer crafting, it should still match singer here, just not crafting.

You may use the pattern:
ing(?=\w)
This would only be true for words which contain ing which is also followed by another word character. Here is an example:
inp = 'singer'
if re.search(r'ing(?=\w)', inp):
print('singer is a MATCH')
inp = 'sing'
if re.search(r'ing(?=\w)', inp):
print('sing is a MATCH')
This prints:
singer is a MATCH
Edit:
To match entire words containing non terminal ing, I suggest using re.findall:
inp = "Madonna is a singer who likes to sing."
matches = re.findall(r'\b\w*ing\w+\b', inp)
print(matches) # prints ['singer']

If the word can not end with ing but must contain ing:
\b\w*ing(?!\w*ing\b)\w+
Explanation
\b A word boundary
\w* Match 0+ word characters
ing Match the required ing
(?!\w*ing\b) Negaetive lookahead, assert the ing is not at the end of the word
\w+ Match 1+ word chars so that there must be at least a single char following
Regex demo | Python demo
For example
import re
items = ["singer","singing","ing","This is a ing testing singalong"]
pattern = r"\b\w*ing(?!\w*ing$)\w+\b"
for item in items:
result = re.findall(pattern, item)
if result:
print(result)
Output
['singer']
['singalong']

You can use this pattern:
import re
pattern = re.compile('\w*ing\w+')
print(pattern.match('sing')) # No match
print(pattern.match('singer')) # Match

Related

A regex pattern that matches all words starting from a word with an s and stopping before a word that starts with an s

I'm trying to capture words in a string such that the first word starts with an s, and the regex stops matching if the next word also starts with an s.
For example. I have the string " Stack, Code and StackOverflow". I want to capture only " Stack, Code and " and not include "StackOverflow" in the match.
This is what I am thinking:
Start with a space followed by an s.
Match everything except if the group is a space and an s (I'm using negative lookahead).
The regex I have tried:
(?<=\s)S[a-z -,]*(?!(\sS))
I don't know how to make it work.
I think this should work. I adapted the regex from this thread. You can also test it out here. I have also included a non-regex solution. I basically track the first occurrence of a word starting with an 's' and the next word starting with an 's' and get the words in that range.
import re
teststring = " Stack, Code and StackOverflow"
extractText = re.search(r"(\s)[sS][^*\s]*[^sS]*", teststring)
print(extractText[0])
#non-regex solution
listwords = teststring.split(' ')
# non regex solution
start = 0
end = 0
for i,word in enumerate(listwords):
if word.startswith('s') or word.startswith('S'):
if start == 0:
start = i
else:
end = i
break
newstring = " " + " ".join([word for word in listwords[start:end]])
print(newstring)
Output
Stack, Code and
Stack, Code and
You could use for example a capture group:
(S(?<!\S.).*?)\s*S(?<!\S.)
Explanation
( Capture group 1
S(?<!\S.) Match S and assert that to the left of the S there is not a whitespace boundary
.*? Match any character, as few as possible
) Close group
\s* Match optional whitespace chars
S(?<!\S.) Match S and assert that to the left of the S there is not a whitespace boundary
See a regex demo and a Python demo.
Example code:
import re
pattern = r"(S(?<!\S.).*?)\s*S(?<!\S.)"
s = "Stack, Code and StackOverflow"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
Stack, Code and
Another option using a lookaround to assert the S to the right and not consume it to allow multiple matches after each other:
S(?<!\S.).*?(?=\s*S(?<!\S.))
Regex demo
import re
pattern = r"S(?<!\S.).*?(?=\s*S(?<!\S.))"
s = "Stack, Code and StackOverflow test Stack"
print(re.findall(pattern, s))
Output
['Stack, Code and', 'StackOverflow test']

How to match regex in python?

describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do
I try to filter the sg-ezsrzerzer out of it (so I want to filter on start sg- till double quote). I'm using python
I currently have:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-.*\b', a)
print(test)
output is
['sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do']
How do I only get ['sg-ezsrzerzer']?
The pattern (?<=group_id=\>").+?(?=\") would work nicely if the goal is to extract the group_id value within a given string formatted as in your example.
(?<=group_id=\>") Looks behind for the sub-string group_id=>" before the string to be matched.
.+? Matches one or more of any character lazily.
(?=\") Looks ahead for the character " following the match (effectively making the expression .+ match any character except a closing ").
If you only want to extract sub-strings where the group_id starts with sg- then you can simply add this to the matching part of the pattern as follows (?<=group_id=\>")sg\-.+?(?=\")
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
results = re.findall('(?<=group_id=\>").+?(?=\")', s)
print(results)
Output
['sg-ezsrzerzer']
Of course you could alternatively use re.search instead of re.findall to find the first instance of a sub-string matching the above pattern in a given string - depends on your use case I suppose.
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
result = re.search('(?<=group_id=\>").+?(?=\")', s)
if result:
result = result.group()
print(result)
Output
'sg-ezsrzerzer'
If you decide to use re.search you will find that it returns None if there is no match found in your input string and an re.Match object if there is - hence the if statement and call to s.group() to extract the matching string if present in the above example.
The pattern \bsg-.*\b matches too much as the .* will match until the end of the string, and will then backtrack to the first word boundary, which is after the o and the end of string.
If you are using re.findall you can also use a capture group instead of lookarounds and the group value will be in the result.
:group_id=>"(sg-[^"\r\n]+)"
The pattern matches:
:group_id=>" Match literally
(sg-[^"\r\n]+) Capture group 1 match sg- and 1+ times any char except " or a newline
" Match the double quote
See a regex demo or a Python demo
For example
import re
pattern = r':group_id=>"(sg-[^"\r\n]+)"'
s = "describe aws_security_group({:group_id=>\"sg-ezsrzerzer\", :vpc_id=>\"vpc-zfds54zef4s\"}) do"
print(re.findall(pattern, s))
Output
['sg-ezsrzerzer']
Match until the first word boundary with \w+:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-\w+', a)
print(test[0])
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
sg- 'sg-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
Results: g-ezsrzerzer

regex match a word after a certain character

I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Regex? Match part of or whole word

I was wondering if it's possible to use regex with python to capture a word, or a part of the word (if it's at the end of the string).
Eg:
target word - potato
string - "this is a sentence about a potato"
string - "this is a sentence about a potat"
string - "this is another sentence about a pota"
Thanks!
import re
def get_matcher(word, minchars):
reg = '|'.join([word[0:i] for i in range(len(word), minchars - 1, -1)])
return re.compile('(%s)$' % (reg))
matcher = get_matcher('potato', 4)
for s in ["this is a sentence about a potato", "this is a sentence about a potat", "this is another sentence about a pota"]:
print matcher.search(s).groups()
OUTPUT
('potato',)
('potat',)
('pota',)
Dont know how to match a regex in python, but the regex would be:
"\bp$|\bpo$|\bpot$|\bpota$|\bpotat$|\bpotato$"
This would match anything from p to potato if its the last word in the string, and also for example not something like "foopotato", if this is what you want.
The | denotes an alternative, the \b is a "word boundary", so it matches a position (not a character) between a word- and a non-word character. And the $ matches the end of the string (also a position).
Use the $ to match at the end of a string. For example, the following would match 'potato' only at the end of a string (first example):
"potato$"
This would match all of your examples:
"pota[to]{1,2}$"
However, some risk of also matching "potao" or "potaot".
import re
patt = re.compile(r'(p|po|pot|pota|potat|potato)$')
patt.search(string)
I was tempted to use r'po?t?a?t?o?$', but that would also match poto or pott.
No, you can't do that with a regex as far as I know, without pointless (p|po|pot ...) matches which are excessive. Instead, just pick off the last word, and match that using a substring:
match = re.search('\S+$', haystack)
if match.group(0) == needle[:len(match.group(0))]:
# matches.

Categories

Resources