Regex? Match part of or whole word

Regex? Match part of or whole word - python

I was wondering if it's possible to use regex with python to capture a word, or a part of the word (if it's at the end of the string).
Eg:
target word - potato
string - "this is a sentence about a potato"
string - "this is a sentence about a potat"
string - "this is another sentence about a pota"
Thanks!

import re
def get_matcher(word, minchars):
reg = '|'.join([word[0:i] for i in range(len(word), minchars - 1, -1)])
return re.compile('(%s)$' % (reg))
matcher = get_matcher('potato', 4)
for s in ["this is a sentence about a potato", "this is a sentence about a potat", "this is another sentence about a pota"]:
print matcher.search(s).groups()
OUTPUT
('potato',)
('potat',)
('pota',)

Dont know how to match a regex in python, but the regex would be:
"\bp$|\bpo$|\bpot$|\bpota$|\bpotat$|\bpotato$"
This would match anything from p to potato if its the last word in the string, and also for example not something like "foopotato", if this is what you want.
The | denotes an alternative, the \b is a "word boundary", so it matches a position (not a character) between a word- and a non-word character. And the $ matches the end of the string (also a position).

Use the $ to match at the end of a string. For example, the following would match 'potato' only at the end of a string (first example):
"potato$"
This would match all of your examples:
"pota[to]{1,2}$"
However, some risk of also matching "potao" or "potaot".

import re
patt = re.compile(r'(p|po|pot|pota|potat|potato)$')
patt.search(string)
I was tempted to use r'po?t?a?t?o?$', but that would also match poto or pott.

No, you can't do that with a regex as far as I know, without pointless (p|po|pot ...) matches which are excessive. Instead, just pick off the last word, and match that using a substring:
match = re.search('\S+$', haystack)
if match.group(0) == needle[:len(match.group(0))]:
# matches.

Related

How to match regex in python?

describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do
I try to filter the sg-ezsrzerzer out of it (so I want to filter on start sg- till double quote). I'm using python
I currently have:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-.*\b', a)
print(test)
output is
['sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do']
How do I only get ['sg-ezsrzerzer']?

The pattern (?<=group_id=\>").+?(?=\") would work nicely if the goal is to extract the group_id value within a given string formatted as in your example.
(?<=group_id=\>") Looks behind for the sub-string group_id=>" before the string to be matched.
.+? Matches one or more of any character lazily.
(?=\") Looks ahead for the character " following the match (effectively making the expression .+ match any character except a closing ").
If you only want to extract sub-strings where the group_id starts with sg- then you can simply add this to the matching part of the pattern as follows (?<=group_id=\>")sg\-.+?(?=\")
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
results = re.findall('(?<=group_id=\>").+?(?=\")', s)
print(results)
Output
['sg-ezsrzerzer']
Of course you could alternatively use re.search instead of re.findall to find the first instance of a sub-string matching the above pattern in a given string - depends on your use case I suppose.
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
result = re.search('(?<=group_id=\>").+?(?=\")', s)
if result:
result = result.group()
print(result)
Output
'sg-ezsrzerzer'
If you decide to use re.search you will find that it returns None if there is no match found in your input string and an re.Match object if there is - hence the if statement and call to s.group() to extract the matching string if present in the above example.

The pattern \bsg-.*\b matches too much as the .* will match until the end of the string, and will then backtrack to the first word boundary, which is after the o and the end of string.
If you are using re.findall you can also use a capture group instead of lookarounds and the group value will be in the result.
:group_id=>"(sg-[^"\r\n]+)"
The pattern matches:
:group_id=>" Match literally
(sg-[^"\r\n]+) Capture group 1 match sg- and 1+ times any char except " or a newline
" Match the double quote
See a regex demo or a Python demo
For example
import re
pattern = r':group_id=>"(sg-[^"\r\n]+)"'
s = "describe aws_security_group({:group_id=>\"sg-ezsrzerzer\", :vpc_id=>\"vpc-zfds54zef4s\"}) do"
print(re.findall(pattern, s))
Output
['sg-ezsrzerzer']

Match until the first word boundary with \w+:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-\w+', a)
print(test[0])
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
sg- 'sg-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
Results: g-ezsrzerzer

Regex to find strings containing substring, but not ending on same substring

I'm trying to write a regex that checks if a string contains the substring "ing", but it most not end on "ing".
So the word sing would not work but singer would.
I think I have figured out how to make sure that the string does not end with ing, for that I'm using
(!<?(ing))$
But I can't seem to get it to work when I want the word to contain "ing" as well. I was thinking something like
(\w+(ing))(!<?(ing))$
But that does not work, all of my solution that sort of makes it work will take in more than one word as well. So it will match singer but not singer crafting, it should still match singer here, just not crafting.

You may use the pattern:
ing(?=\w)
This would only be true for words which contain ing which is also followed by another word character. Here is an example:
inp = 'singer'
if re.search(r'ing(?=\w)', inp):
print('singer is a MATCH')
inp = 'sing'
if re.search(r'ing(?=\w)', inp):
print('sing is a MATCH')
This prints:
singer is a MATCH
Edit:
To match entire words containing non terminal ing, I suggest using re.findall:
inp = "Madonna is a singer who likes to sing."
matches = re.findall(r'\b\w*ing\w+\b', inp)
print(matches) # prints ['singer']

If the word can not end with ing but must contain ing:
\b\w*ing(?!\w*ing\b)\w+
Explanation
\b A word boundary
\w* Match 0+ word characters
ing Match the required ing
(?!\w*ing\b) Negaetive lookahead, assert the ing is not at the end of the word
\w+ Match 1+ word chars so that there must be at least a single char following
Regex demo | Python demo
For example
import re
items = ["singer","singing","ing","This is a ing testing singalong"]
pattern = r"\b\w*ing(?!\w*ing$)\w+\b"
for item in items:
result = re.findall(pattern, item)
if result:
print(result)
Output
['singer']
['singalong']

You can use this pattern:
import re
pattern = re.compile('\w*ing\w+')
print(pattern.match('sing')) # No match
print(pattern.match('singer')) # Match

Looking for Italicized Text In Python using re.sub

the gist of this is, i'm making a function that removes italicized text by using re.sub and duplicate the text. The function has an argument named sentence that contains a string.
A few examples:
sentence = <i>All of this text is italicized.</i>
Return value = "All of this text is italicized. All of this text is italicized."
sentence = <i>beep</i><i>bop</i><i>boop</i><i>bonk</i>
Return value: "beep beepbop bopboop boopbonk bonk"
sentence = "I <i>Like</i>, food because <i>it's so great</i>!"
return value: "I Like Like food because it's so great it's so great!".
Here's what i have so far:
pattern = r'<.*?>'
return re.sub(pattern, i, sentence)
Anyone can help?

First, your pattern is wrong - it matches everything from first < to last > which is clearly not what you want. Second, for i in sentence makes no sense - iterating over a string gives you single characters of the string, which won't match your pattern anyway.
This, however, seems to do what you want:
return re.sub('<i>(.*?)</i>', r'\1 \1', sentence)
\1 is a reference to whatever the first capturing group, ie. (.*?), has matched, and it is used twice to achieve the doubling effect.

Python Find entire word in string using regex and user input

I'm trying to find the entire word exactly using regex but have the word i'm searching for be a variable value coming from user input. I've tried this:
regex = r"\b(?=\w)" + re.escape(user_input) + r"\b"
if re.match(regex, string_to_search[i], re.IGNORECASE):
<some code>...
but it matches every occurrence of the string. It matches "var"->"var" which is correct but also matches "var"->"var"iable and I only want it to match "var"->"var" or "string"->"string"
Input: "sword"
String_to_search = "There once was a swordsmith that made a sword"
Desired output: Match "sword" to "sword" and not "swordsmith"

You seem you want to use a pattern that matches an entire string. Note that \b word boundary is needed when you wan to find partial matches. When you need a full string match, you need anchors. Since re.match anchors the match at the start of string, all you need is $ (end of string position) at the end of the pattern:
regex = '{}$'.format(re.escape(user_input))
and then use
re.match(regex, search_string, re.IGNORCASE)

You can try re.finditer like that:
>>> import re
>>> user_input = "var"
>>> text = "var variable var variable"
>>> regex = r"(?=\b%s\b)" % re.escape(user_input)
>>> [m.start() for m in re.finditer(regex, text)]
[0, 13]
It'll find all matches iteratively.

Identifying lines with consecutive upper case letters

I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).

Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.

Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)

print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!

Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex? Match part of or whole word - python

Use the $ to match at the end of a string. For example, the following would match 'potato' only at the end of a string (first example): "potato$" This would match all of your examples: "pota[to]{1,2}$" However, some risk of also matching "potao" or "potaot".

import re patt = re.compile(r'(p|po|pot|pota|potat|potato)$') patt.search(string) I was tempted to use r'po?t?a?t?o?$', but that would also match poto or pott.

No, you can't do that with a regex as far as I know, without pointless (p|po|pot ...) matches which are excessive. Instead, just pick off the last word, and match that using a substring: match = re.search('\S+$', haystack) if match.group(0) == needle[:len(match.group(0))]: # matches.

Related

How to match regex in python?

Regex to find strings containing substring, but not ending on same substring

Looking for Italicized Text In Python using re.sub

Python Find entire word in string using regex and user input

Identifying lines with consecutive upper case letters

Categories

Resources