Extract a string between two set of patterns in Python

Extract a string between two set of patterns in Python - python

I am trying to extract a substring between two set of patterns using re.search().
On the left, there can be either 0x or 0X, and on the right there can be either U, , or \n. The result should not contain boundary patterns. For example, 0x1234U should result in 1234.
I tried with the following search pattern: (0x|0X)(.*)(U| |\n), but it includes the left and right patterns in the result.
What would be the correct search pattern?

You could use also use a single group using .group(1)
0[xX](.*?)[U\s]
The pattern matches:
0[xX] Match either 0x or 0X
(.*?) Capture in group 1 matching any character except a newline, as least as possible
[U\s] Match either U or a whitespace characters (which could also match a newline)
Regex demo | Python demo
import re
s = r"0x1234U"
pattern = r"0[xX](.*?)[U\s]"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
1234

You could use a combination of lookbehind and lookahead with a non-greedy match pattern in between:
import re
pattern = r"(?<=0[xX])(.*?)(?=[U\s\n])"
re.findall(pattern,"---0x1234U...0X456a ")
['1234', '456a']

Related

How to return full substring from partial substring match in python as a list?

I have different length strings which have to be checked for substrings which match patterns of "tion", "ex", "ph", "ost", "ast", "ist" ignoring the case and the position i.e. prefix/suffix/middle of word. The matching words have to be returned in a new list rather than the matching substring element alone. With the below code I can return a new list of matching substring element without the full matching word.
def latin_ish_words(text):
import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
matches=pattern.findall(text)
return matches
latin_ish_words("This functions as expected")
With the results as follows:['tion', 'ex']
I was wondering how I could return the whole word rather than the matching substring element into a newlist?

You can use
pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
pattern=re.compile(r"[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*")
pattern=re.compile(r"[^\W\d_]*?(?:tion|ex|ph|ost|ast|ist)[^\W\d_]*")
The regex (see the regex demo) matches
\w*? - zero or more but as few as possible word chars
(?:tion|ex|ph|ost|ast|ist) - one of the strings
\w* - zero or more but as many as possible word chars
The [a-zA-Z] part will match only ASCII letters, and [^\W\d_] will match any Unicode letters.
Mind the use of the non-capturing group with re.findall, as otherwise, the captured substrings will also get their way into the output list.
If you need to only match letter words, and you need to match them as whole words, add word boundaries, r"\b[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*\b".
See the Python demo:
import re
def latin_ish_words(text):
import re
pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
return pattern.findall(text)
print(latin_ish_words("This functions as expected"))
# => ['functions', 'expected']

ignoring the case
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
matches=pattern.findall(text)
does not do that, consider following example
import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
text = "SCREAMING TEXT"
print(pattern.findall(text))
output
[]
despite there should be EX, you should add re.IGNORECASE flag like so
import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist", re.IGNORECASE)
text = "SCREAMING TEXT"
print(pattern.findall(text))
output
['EX']

For a case insensitive match with whitspace boundaries you could use:
(?i)(?<!\S)\w*(?:tion|ex|ph|[oia]st)\w*(?!\S)
The pattern matches:
(?i) Inline modifier for a case insensitive match (Or use re.I)
(?<!\S) Assert a whitespace boundary to the left
\w* Match optional word characters
(?: Non capture group
tion|ex|ph|[oia]st Match either tion ex php or ost ist ast using a character class
) Close non capture group
\w* Match optional word characters
(?!\S) Assert a whitespace boundary to the right
Regex demo | Python demo
def latin_ish_words(text):
import re
pattern = r"(?i)(?<!\S)\w*(?:tion|ex|ph|[oia]st)\w*(?!\S)"
return re.findall(pattern, text)
print(latin_ish_words("This functions as expected"))
Output
['functions', 'expected']

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?

Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.

Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?

You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8

No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.

A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

Regex similar to a Hearst Pattern in Python

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:
NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force
This is my attempt but it doesn't get any matches:
(NP_\\w+ (, )?is (NP_\\w+ ?))

In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.
Also move the last parenthesis from )) to (NP_\w+) to create the first group.
The pattern including the optional comma and space could be:
(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)
Regex demo
If you don't need the space at the end and the comma space is not present, you pattern could be:
(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
(NP_\w+) Capture group 1 Match NP_ and 1+ word chars
(?: \([^()]+\))? Optionally match a space and a part with parenthesis
is Match literally
(NP_\w+) Capture group 2 Match NP_ and 1+ word chars
See a regex demo | Python demo
For example
import re
regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
print(matches.group(2))
Output
NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force

I got one, quite simple:
regex = r"NP.\w+ ?Forces?\b
You can see how it works out, it's a online tool to write and test regex for multiple languages:
https://regex101.com/r/KKH3D3/1/

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国，中，。》％国foo中¥国bar#中123=国％中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?

Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国，中，。》％国foo中¥国bar#中123=国％中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))

I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国，中，。》％国foo中¥国bar#中123=国％中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a string between two set of patterns in Python - python

You could use a combination of lookbehind and lookahead with a non-greedy match pattern in between: import re pattern = r"(?<=0[xX])(.*?)(?=[U\s\n])" re.findall(pattern,"---0x1234U...0X456a ") ['1234', '456a']

Related

How to return full substring from partial substring match in python as a list?

Searching for a pattern in a sentence with regex in python

Regex : matching integers inside of brackets

Regex similar to a Hearst Pattern in Python

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

Categories

Resources