In python, I am trying to regex of a expression like this:
function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)
I am using this regex
(?P<perf_name>\w*?)\((?P<perf_param>[\w]+)*(?:,*(?P<perf_param2>[\w]+)?)*\)
but I'm stuck because so far I can't get all the params_x which are not close to brackets (param_2, param_8 and param_9)
Plus, I am pretty sure there is some solution that would prevent me to use a single perf_param instead of the two perf_param and perf_param2
Any ideas?
You should do that in 2 steps:
(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)
This regex will get you the name and params as two groups. Then, just split the second group with ,.
import re
p = re.compile(r'(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)')
s = "function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)"
res = [(x.group("perf_name"), x.group("perf_params").split(",")) for x in p.finditer(s)]
print(res)
# => [('function_1', ['param_1', 'param_2', 'param_3']), ('function_2', ['param_4', 'param_5']), ('function_3', ['param_6']), ('function_4', ['']), ('function_5', ['param_7', 'param_8', 'param_9', 'param_10'])]
See the Python demo
The regex matches:
(?P<perf_name>\w*) - 0 or more alphanumeric/underscore characters
\( - a literal (
(?P<perf_params>\w*(?:,\w+)*) - 0+ sequences of 0+ word characters (\w*) followed with 0+ sequences of 1+ word characters
\) - closing ).
Related
I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.
First time ever using regular expressions and can't get it working although there's quite a few examples in stackoverflow already.
How can I extract integers which are in a string inside bracket?
Example:
dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x
would return
[43, 32, -99]
'+' and '-' is okay, if it's in the beginning of the brackets, but not okay if it's in the middle or end. If the '+' sign is in the beginning, it should not be taken into account. (+54 --> 54)
Been trying :
re.findall('\[[-]?\d+\]',str)
but it's not working the way I want.
If you need to fail the match in [ +-34 ] (i.e. if you needn't extract a negative number if there is a + before it) you will need to use
\[\s*(?:\+|(-))?(\d+)\s*]
and when getting a match, concat the Group 1 and Group 2 values. See this regex demo.
Details
\[ - a [ char
\s* - 0+ whitespaces
\+? - an optional + char
(-?\d+) - Capturing group 1 (the actual output of re.findall): an optional - and 1+ digits
\s* - 0+ whitespaces
] - a ] char.
In Python,
import re
text = "dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x"
numbers_text = [f"{x}{y}" for x, y in re.findall(r'\[\s*(?:\+|(-))?(\d+)\s*]', text)]
numbers = list(map(int, numbers_text))
# => [43, 32, -99] for both
If you want to extract integers from a string the code that I use is this:
def stringToNumber(inputStr):
myNumberList = []
for s in inputStr.split():
newString = ''.join(i for i in s if i.isdigit())
if (len(newString) != 0):
myNumberList.append(newString)
return myNumberList
I hope it works for you.
If you've not done so I suggest you switch to the PyPI regex module. Using it here with regex.findall and the following regular expression allows you to extract just what you need.
r'\[ *\+?\K-?\d+(?= *\])'
regex engine <¯\(ツ)/¯> Python code
At the regex tester pass your cursor across the regex for details about individual tokens.
The regex engine performs the following operations.
\[ : match '['
\ * : match 0+ spaces
\+? : optionally match '+'
\K : forget everything matched so far and reset
start of match to current position
-? : optionally match '-'
\d+ : match 1+ digits
(?= *\]) : use positive lookahead to assert the last digit
: matched is followed by 0+ spaces then ']'
I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.
I have the following regex expression
re.findall('\(([0-9].*?)\)', a[a.find('('):].strip())
defined for strings like
asdasdasd (21345-asdasdasd)
to retrieve what is inside parenthesis followed by a number. But I also want to be capable to retrieve what is inside followed by 'NA' string, like:
asdasdasd (NA-asdasdasd)
I've tried:
re.findall('\(([0-9].*?)\)|\((NA.*?)\)', a[a.find('('):].strip())
but produces a tuple. How would it be? Thank you in advance!
You may capture the substring between parentheses when the text inside starts with digits / NA followed with - and any other chars other than ( and ) using
re.findall(r'\(((?:[0-9]+|NA)-[^)]*)\)', a)
See the regex demo.
Details
\( - a (
((?:[0-9]+|NA)-[^)]*) - Capturing group (this value will be returned by re.findall):
(?:[0-9]+|NA) - 1 or more digits or NA
- - a hyphen
[^)]* - 0+ chars other than )
\) - a ) char.
See the Python demo:
import re
strs = ['asdasdasd (21345-asdasdasd)', 'asdasdasd (NA-asdasdasd)']
for s in strs:
print(re.findall(r'\(((?:[0-9]+|NA)-[^)]*)\)', s))
Output:
['21345-asdasdasd']
['NA-asdasdasd']
I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.