Searching for a pattern in a sentence with regex in python - python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?

Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.

Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Related

Regular expression for substitution of similar pattern in a string in Python

I want to use a regular expression to detect and substitute some phrases. These phrases follow the
same pattern but deviate at some points. All the phrases are in the same string.
For instance I have this string:
/this/is//an example of what I want /to///do
I want to catch all the words inside and including the // and substitute them with "".
To solve this, I used the following code:
import re
txt = "/this/is//an example of what i want /to///do"
re.search("/.*/",txt1, re.VERBOSE)
pattern1 = r"/.*?/\w+"
a = re.sub(pattern1,"",txt)
The result is:
' example of what i want '
which is what I want, that is, to substitute the phrases within // with "". But when I run the same pattern on the following sentence
"/this/is//an example of what i want to /do"
I get
' example of what i want to /do'
How can I use one regex and remove all the phrases and //, irrespective of the number of // in a phrase?
In your example code, you can omit this part re.search("/.*/",txt1, re.VERBOSE) as is executes the command, but you are not doing anything with the result.
You can match 1 or more / followed by word chars:
/+\w+
Or a bit broader match, matching one or more / followed by all chars other than / or a whitspace chars:
/+[^\s/]+
/+ Match 1+ occurrences of /
[^\s/]+ Match 1+ occurrences of any char except a whitespace char or /
Regex demo
import re
strings = [
"/this/is//an example of what I want /to///do",
"/this/is//an example of what i want to /do"
]
for txt in strings:
pattern1 = r"/+[^\s/]+"
a = re.sub(pattern1, "", txt)
print(a)
Output
example of what I want
example of what i want to
You can use
/(?:[^/\s]*/)*\w+
See the regex demo. Details:
/ - a slash
(?:[^/\s]*/)* - zero or more repetitions of any char other than a slash and whitespace
\w+ - one or more word chars.
See the Python demo:
import re
rx = re.compile(r"/(?:[^/\s]*/)*\w+")
texts = ["/this/is//an example of what I want /to///do", "/this/is//an example of what i want to /do"]
for text in texts:
print( rx.sub('', text).strip() )
# => example of what I want
# example of what i want to

How to match regex in python?

describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do
I try to filter the sg-ezsrzerzer out of it (so I want to filter on start sg- till double quote). I'm using python
I currently have:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-.*\b', a)
print(test)
output is
['sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do']
How do I only get ['sg-ezsrzerzer']?
The pattern (?<=group_id=\>").+?(?=\") would work nicely if the goal is to extract the group_id value within a given string formatted as in your example.
(?<=group_id=\>") Looks behind for the sub-string group_id=>" before the string to be matched.
.+? Matches one or more of any character lazily.
(?=\") Looks ahead for the character " following the match (effectively making the expression .+ match any character except a closing ").
If you only want to extract sub-strings where the group_id starts with sg- then you can simply add this to the matching part of the pattern as follows (?<=group_id=\>")sg\-.+?(?=\")
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
results = re.findall('(?<=group_id=\>").+?(?=\")', s)
print(results)
Output
['sg-ezsrzerzer']
Of course you could alternatively use re.search instead of re.findall to find the first instance of a sub-string matching the above pattern in a given string - depends on your use case I suppose.
import re
s = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
result = re.search('(?<=group_id=\>").+?(?=\")', s)
if result:
result = result.group()
print(result)
Output
'sg-ezsrzerzer'
If you decide to use re.search you will find that it returns None if there is no match found in your input string and an re.Match object if there is - hence the if statement and call to s.group() to extract the matching string if present in the above example.
The pattern \bsg-.*\b matches too much as the .* will match until the end of the string, and will then backtrack to the first word boundary, which is after the o and the end of string.
If you are using re.findall you can also use a capture group instead of lookarounds and the group value will be in the result.
:group_id=>"(sg-[^"\r\n]+)"
The pattern matches:
:group_id=>" Match literally
(sg-[^"\r\n]+) Capture group 1 match sg- and 1+ times any char except " or a newline
" Match the double quote
See a regex demo or a Python demo
For example
import re
pattern = r':group_id=>"(sg-[^"\r\n]+)"'
s = "describe aws_security_group({:group_id=>\"sg-ezsrzerzer\", :vpc_id=>\"vpc-zfds54zef4s\"}) do"
print(re.findall(pattern, s))
Output
['sg-ezsrzerzer']
Match until the first word boundary with \w+:
import re
a = 'describe aws_security_group({:group_id=>"sg-ezsrzerzer", :vpc_id=>"vpc-zfds54zef4s"}) do'
test = re.findall(r'\bsg-\w+', a)
print(test[0])
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
sg- 'sg-'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
Results: g-ezsrzerzer

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured
You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.
As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs
You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.
You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

Regex similar to a Hearst Pattern in Python

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:
NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force
This is my attempt but it doesn't get any matches:
(NP_\\w+ (, )?is (NP_\\w+ ?))
In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.
Also move the last parenthesis from )) to (NP_\w+) to create the first group.
The pattern including the optional comma and space could be:
(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)
Regex demo
If you don't need the space at the end and the comma space is not present, you pattern could be:
(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
(NP_\w+) Capture group 1 Match NP_ and 1+ word chars
(?: \([^()]+\))? Optionally match a space and a part with parenthesis
is Match literally
(NP_\w+) Capture group 2 Match NP_ and 1+ word chars
See a regex demo | Python demo
For example
import re
regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
print(matches.group(2))
Output
NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force
I got one, quite simple:
regex = r"NP.\w+ ?Forces?\b
You can see how it works out, it's a online tool to write and test regex for multiple languages:
https://regex101.com/r/KKH3D3/1/

Categories

Resources