Python Regex: To capture all words within nested parentheses - python

I am trying to extract all words within nested parentheses by using regex. Here is an example of my .txt file:
hello ((
(alpha123_4rf)
45beta_Frank))
Red5Great_Sam_Fun
I have tried this with regex:
r'[\((?\(??(^\()?\))]'
but have not been able to get the desired output. I want my output to be like this:
((
(alpha123_4rf)
45beta_Frank))
What am I doing wrong? Any help is greatly appreciated!

Try this pattern (?s)\([^(]*\((.+)\)[^)]*\)
Explanation:
(?s) - flag: single line mode - . matches also newline character
\( - match ( literally
[^(]* - match zero or more characters other from (
\( - match ( literally
(.+) - match one or mroe of any characters and store it inside first capturing group
\) - match ) literally
[^)]* - match zero or more characters other from )
\) - match ) literally
Demo

If the parantheses are directly following each other, this simpler solution would also do it:
def find_brackets(text):
rx = "(?s)\(\((.+)\)\)"
z = re.search(rx,text)
if z:
return z[0]
else:
return ''

Related

regex doesn't seem to work on the input given as expected

My regex doesn't seem to work as expected, can someone help me fixing it?
import re
a = """
xyz # (.C (0),
.H (1)
)
mv [F-1:0] (/*AUTOINST*/
except_check
#(
.a (m),
.b (w),
.c (x),
.d (1),
.e (1)
)
data_check
(// Outputs
abc
#(
.a (b::c)
)
mask
(/*AUTOINST*/
"""
op = re.findall(r'^\s*(\w+)\s*$\n(?:^\s*[^\w\s].*$\n)*^\s*(\w+)\s*\(', a, re.MULTILINE)
for i in op:
print(i)
This is the output I get:
('except_check', 'data_check')
('abc', 'mask')
This is the expected output:
('xyz', 'mv')
('except_check', 'data_check')
('abc', 'mask')
Somehow, the regex doesn't work for first block of input and works fine for other two blocks of input.
"(\w+)\s+#\s?(\D*\S*\D*\s*\d?\W+)\s*(\w+)"gm
use this works
you can further simplify
Here is a regex with the minimal changes:
^\s*(\w+)(?:\s*[^\w\s].*$\n)*^\s*(\w+)[^()]*\(
See the regex demo.
The \s*$\n(?:^\s*[^\w\s] part is replaced with (?:\s*[^\w\s], as your first block does not contain a line break.
At the end, \s*\( is replaced with [^()]*\( because there are chars other than whitespace between the word you want to extract and a ( char.
Details:
^ - start of a line (granted you use re.M)
\s* - zero or more whitespaces
(\w+) - Group 1: one or more word chars
(?:\s*[^\w\s].*\n)* - zero or more occurrences of zero or more whitespaces, a special char other than _, the rest of the line and an LF char
^ - start of a line
\s* - zero or more whitespaces
(\w+) - Group 2: one or more word chars
[^()]* - zero or more chars other than ( and )
\( - a ( char.
Or, I think you can leverage the recursion feature available in the PyPi regex. Run pip install regex in the terminal/console and then
import regex
a = 'your_string_here'
rx = r'^\s*(\w+)\s*#\s*(\((?:[^()]++|(?2))*\))\s*(\w+)'
matches = [(x.group(1), x.group(3)) for x in regex.finditer(rx, a, regex.M)]
Here is the regex demo. It matches:
^ - start of a line
\s* - zero or more whitespaces
(\w+) - Group 1: one or more word chars
\s*#\s* - a # enclosed with zero or more whitespaces
(\((?:[^()]++|(?2))*\)) - Group 2: a ( char, then any zero or more occurrences of any one or more chars other than ( and ) or Group 2 pattern, and then a )
\s* - zero or more whitespaces
(\w+) - Group 2: one or more word chars.

What does this regex pattern match?

regex = re.compile(r"\s*[-*+]\s*(.+)")
Especially this part: \s*[-*+]
I want to match this string:
[John](person)is good and [Mary](person) is good too.
But it fails.
Does the \s*[-*+] mean the following:
matches an optional space, followed by one of the characters: -, *, +
This is in Python.
Pattern \s*[-*+]\s*(.+) means:
\s* - match zero or more whitesapces
[-*+] - match one characters from the set: - or * or +
(.+) - match one or more of any characters and store it inside capturing group (. means any character and brackets denote capturing group)
In your sentence, pattern won't match anything due to lack of any of characters from the set -*+.
It would match, for example * (person) is good too. in
[John](person)is good and [Mary] * (person) is good too.
Demo
In order to match names and their description in brackets use \[([^\]]+)\]\(([^)]+)
Explanation:
\[ - match [ literally
([^\]]+) - match one or more characters other from ] and store it in first captuirng group
\] - match [ literally
\( - match ( literally
([^)]+) - match one or more characters other from )
Demo

Or condition for literal string in regex expression

I have the following regex expression
re.findall('\(([0-9].*?)\)', a[a.find('('):].strip())
defined for strings like
asdasdasd (21345-asdasdasd)
to retrieve what is inside parenthesis followed by a number. But I also want to be capable to retrieve what is inside followed by 'NA' string, like:
asdasdasd (NA-asdasdasd)
I've tried:
re.findall('\(([0-9].*?)\)|\((NA.*?)\)', a[a.find('('):].strip())
but produces a tuple. How would it be? Thank you in advance!
You may capture the substring between parentheses when the text inside starts with digits / NA followed with - and any other chars other than ( and ) using
re.findall(r'\(((?:[0-9]+|NA)-[^)]*)\)', a)
See the regex demo.
Details
\( - a (
((?:[0-9]+|NA)-[^)]*) - Capturing group (this value will be returned by re.findall):
(?:[0-9]+|NA) - 1 or more digits or NA
- - a hyphen
[^)]* - 0+ chars other than )
\) - a ) char.
See the Python demo:
import re
strs = ['asdasdasd (21345-asdasdasd)', 'asdasdasd (NA-asdasdasd)']
for s in strs:
print(re.findall(r'\(((?:[0-9]+|NA)-[^)]*)\)', s))
Output:
['21345-asdasdasd']
['NA-asdasdasd']

regex of symbolic expression grouped

In python, I am trying to regex of a expression like this:
function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)
I am using this regex
(?P<perf_name>\w*?)\((?P<perf_param>[\w]+)*(?:,*(?P<perf_param2>[\w]+)?)*\)
but I'm stuck because so far I can't get all the params_x which are not close to brackets (param_2, param_8 and param_9)
Plus, I am pretty sure there is some solution that would prevent me to use a single perf_param instead of the two perf_param and perf_param2
Any ideas?
You should do that in 2 steps:
(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)
This regex will get you the name and params as two groups. Then, just split the second group with ,.
import re
p = re.compile(r'(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)')
s = "function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)"
res = [(x.group("perf_name"), x.group("perf_params").split(",")) for x in p.finditer(s)]
print(res)
# => [('function_1', ['param_1', 'param_2', 'param_3']), ('function_2', ['param_4', 'param_5']), ('function_3', ['param_6']), ('function_4', ['']), ('function_5', ['param_7', 'param_8', 'param_9', 'param_10'])]
See the Python demo
The regex matches:
(?P<perf_name>\w*) - 0 or more alphanumeric/underscore characters
\( - a literal (
(?P<perf_params>\w*(?:,\w+)*) - 0+ sequences of 0+ word characters (\w*) followed with 0+ sequences of 1+ word characters
\) - closing ).

Match first parenthesis with Python

From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Categories

Resources