I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular (atomic grouping since Ruby 1.9.3)
JavaScript API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)
Related
I am trying to parse the SQLite sources for error messages and my current approach has most cases covered, I think.
My regex:
(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError)\([^;\"]+\"([^)]+)\"(?:,|\)|:)
Source snippet (not valid C, only for demonstration):
sqlite3ErrorMsg(pParse, variable);
sqlite3ErrorMsg(pParse, "row value misused");
){
sqlite3ErrorMsg(pParse, "no \"such\" function: %.*s", nId, zId);
pNC->nErr++;
}else if( wrong_num_args ){
sqlite3ErrorMsg(pParse,"wrong number of arguments to function %.*s()",
nId, zId);
pNC->nErr++;
}
if( pExpr->iTable<0 ){
sqlite3ErrorMsg(pParse,
"second argument to likelihood must be a "
"constant between 0.0 and 1.0");
pNC->nErr++;
}
}else if( wrong_num_args ){
sqlite3ErrorMsg(pParse,"factory must return a cursor, not \\w+",
nId);
pNC->nErr++;
This successfully outputs the following capture groups:
row value misused
no \"such\" function: %.*s
second argument to likelihood must be a "
"constant between 0.0 and 1.0
factory must return a cursor, not \\w+
However, it misses wrong number of arguments to function %.*s() - because of the ().
Regex101 example
I have also tried to capture from " to " with a negative look-behind to allow escaped \" (as not to skip over no \"such\" function: %.*s), but I could not get it to work, because my regex-foo is not that strong and there's also the cases of the multiline strings.
I've also tried to combine the answers from Regex for quoted string with escaping quotes with my regex, but that did not work for me, either.
The genereal idea is:
There's a function call with one of the three mentioned function names (sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError), followed by a non-string parameter that I'm not interested in, followed by at least one parameter that may be either a variable (don't want that) or a string (that's what I'm looking for!), followed by an optional arbitrary number of parameters.
The string that I want may be a multiline-string and may also contain escaped quotes, parenthesis and whatever else is allowed in a C string.
I'm using Python 3.7
You may consider the following pattern:
(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError)\s*\(\s*\w+,((?:\s*"[^"\\]*(?:\\.[^"\\]*)*")+)
See the regex demo. You will need to remove the delimiting double quotes manually from each line in a match.
Details:
(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError) - one of the three substrings
\s*\(\s* - a ( char enclosed with zero or more whitespaces
\w+ - one or more word chars
, - a comma
((?:\s*"[^"\\]*(?:\\.[^"\\]*)*")+) - Group 1: one or more repetitions of
\s* - zero or more whitespace
" - a "
[^"\\]* - zero or more chars other than \ and "
(?:\\.[^"\\]*)* - zero or more repetitions of a \ and then any char followed with zero or more chars other than " and \
" - a " char.
Sample Python code:
import re
file = "sqlite3ErrorMsg(pParse, variable); \n sqlite3ErrorMsg(pParse, \"row value misused\");\n ){\n sqlite3ErrorMsg(pParse, \"no \\\"such\\\" function: %.*s\", nId, zId);\n pNC->nErr++;\n }else if( wrong_num_args ){\n sqlite3ErrorMsg(pParse,\"wrong number of arguments to function %.*s()\",\n nId, zId);\n pNC->nErr++;\n }\n if( pExpr->iTable<0 ){\n sqlite3ErrorMsg(pParse,\n \"second argument to likelihood must be a \"\n \"constant between 0.0 and 1.0\");\n pNC->nErr++;\n }\n }else if( wrong_num_args ){\n sqlite3ErrorMsg(pParse,\"factory must return a cursor, not \\\\w+\", \n nId);\n pNC->nErr++;"
rx = r'(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError)\s*\(\s*\w+,((?:\s*"[^"\\]*(?:\\.[^"\\]*)*")+)'
matches = [" ".join(map(lambda x: x.strip(' "'), m.strip().splitlines())) for m in re.findall(rx, file)]
print(matches)
Output:
['row value misused', 'no \\"such\\" function: %.*s', 'wrong number of arguments to function %.*s()', 'second argument to likelihood must be a constant between 0.0 and 1.0', 'factory must return a cursor, not \\\\w+']
I am writing a code find a specific pattern in a given string using python or perl. I had some success in finding the pattern using C but python or perl usage is mandatory for this assignment and I am very new in both of these lanuages.
My string looks like this (Amino acid sequence) :-
MKTSGNQDEILVIRKGWLTINNIGIMKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFAL
The pattern I want to find is
KXXXXXX(K\R)XR
Please note that Letters between K and K\R are not fixed. However, there is only letter between K\R and R. So, in the given string my pattern is like this and exist between letter no. 54 to 65 (if I counted correctly) based on "smallest pattern" search :-
KYMLSVDNLKLR
Previously, I was using C if-else condition to break this given string and printed out word count (not fully successful).
printf(%c, word[i]);
if ((word [i] == 'K' || word [i] == 'R' )) && word [i+2] == 'R') {
printf("\n");
printf("%d\n",i);
}
I agree It dint capture everything. If anyone can help me help me solving this problem, that would be great.
You say you want the match to be non-greedy, but that doesn't make sense. I think you are trying to find the minimal match. If so, that's very hard to do. This is the regex match you need:
/
K
(?: (?: [^KR] | R(?!.R) )+
| .
)
[KR]
.
R
/sx
However, it wouldn't surprise me if there's a bug. The only sure way to find a minimal match is to find all possible matches.
my $match;
while (/(?= ( K.+[KR].R ) )/sxg) {
if (!defined($match) || length($1) > length($match)) {
$match = $1;
}
}
But this will be far slower, especially for long strings.
Regardless of the language, this looks a task suitable for regular expressions.
Here is an example of how you could do the regex in python. If you want the index where the match starts, you can do:
m = re.search(r'K(?:[A-JL-Z]+?|K)[KR][A-Z]R', s)
print m.start() # prints index
print m.group() # prints matching string
Or as #bunji points out, you an use finditer as well:
for m in re.finditer(r'K(?:[A-JL-Z]+?|K)[KR][A-Z]R', s):
print m.start() # prints index
print m.group() # prints matching string
Only did it this way because I hate back tracking in my regular expressions. But I do find its usually faster if I perform the most restrictive part of the match first. Which in this case is made simpler by reversing the input and the search pattern. This should stop at the first (shortest) possible match; rather than finding the longest match, then hunting down the shortest.
#!/usr/bin/perl
use strict;
use warnings;
my $pattern = "MKTSGNQDEILVIRKGWLTINNIGIMKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFAL";
my $reverse = reverse $pattern;
my $length = length $reverse;
if( $reverse =~ /(R.[KR][^K]+K)/ ) {
my $match = $1;
$match = reverse $match;
my $start_p = $length-$+[0];
my $end_p = $length-$-[0]-1;
my $where = $start_p + length $match;
print "FOUND ...\n";
print "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\n";
print $pattern."\n";
printf "%${where}s\n", $match;
print "Found pattern '$match' starting at position '$start_p' and ending at position '$end_p'\n";
# test it
if( $pattern =~ /$match/ ) {
if( $start_p == $-[0] && $end_p == $+[0]-1 ) {
print "Test successful, match found in original pattern.\n";
} else {
print "Test failed, you screwed something up!\n";
}
} else {
print "Hmmm, pattern '$match' wasn't found in '$pattern'?\n";
}
} else {
print "Dang, no match was found!\n";
}
I'm not certain if the elimination of back-tracking here would outweigh the performance hit of the reversing. I guess it would depend greatly on the sizes of both the input string and the length of what could possibly match.
$> perl ./search.pl
FOUND ...
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
MKTSGNQDEILVIRKGWLTINNIGIMKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFAL
KYMLSVDNLKLR
Found pattern 'KYMLSVDNLKLR' starting at position '53' and ending at position '64'
Test successful, match found in original pattern.
I apologize to those that don't understand why I started at zero.
And a bit more real world example - which will find interwoven matches.
#!/usr/bin/perl
use strict;
use warnings;
# NOTE THE INPUT WAS MODIFIED FROM OP
my $input = "MKTSGNQDEILVIRKKRKRRGWKLTINNIRGRIMRGRKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFALKGR";
my $rstart = length $input;
my( $match, $start, $end ) = rsearch( $input, "R.[KR].+?K" );
while( $match ) {
print "Found matching pattern '$match' starting at offset '$start' and ending at offset $end\n";
$input = substr $input, 0, $end;
( $match, $start, $end ) = rsearch( $input, "R.[KR].+?K" );
}
exit(0);
sub rsearch {
my( $input, $pattern ) = #_;
my $reverse = reverse $input;
if( $reverse =~ /($pattern)/ ) {
my $length = length $reverse;
$match = reverse $1;
$start = $length-$+[0];
$end = $length-$-[0]-1;
return( $match, $start, $end );
}
return( undef );
}
perl ./search.pl
Found matching pattern 'KHIFALKGR' starting at offset '85' and ending at offset 93
Found matching pattern 'KYMLSVDNLKLR' starting at offset '64' and ending at offset 75
Found matching pattern 'KLTINNIRGRIMRGR' starting at offset '22' and ending at offset 36
Found matching pattern 'KLTINNIRGR' starting at offset '22' and ending at offset 31
Found matching pattern 'KRKRR' starting at offset '15' and ending at offset 19
Found matching pattern 'KKRKR' starting at offset '14' and ending at offset 18
Found matching pattern 'KTSGNQDEILVIRKKR' starting at offset '1' and ending at offset 16
I am implementing Python grammar in AnTLR4 but I am facing the same problem with INDENT and DEDENT discussed here: ANTLR4- dynamically inject token
The solution I am trying is to convert the solution by Ter that can be found here http://antlr3.org/grammar/1078018002577/python.tar.gz (override nextToken and insert imaginary tokens).
The problem is that this solution assumes that we have a lexer rule like:
LEADING_WS
: {getColumn()==1}?
// match spaces or tabs, tracking indentation count
( ' ' { spaces++; }
| '\t' { spaces += 8; spaces -= (spaces % 8); }
| '\014' // formfeed is ok
)+
{
}
...
but I keep getting an error because actions in lexer rule must be last element on single altermost alternative.
can anyone help me to find a solution?
Thanks a lot!!!
You need to move your calculation involving spaces to either the end of the LEADING_WS rule or your implementation of nextToken. At the end of LEADING_WS it could look like the following.
LEADING_WS
: {getColumn()==1}?
// match spaces or tabs, tracking indentation count
[ \t]+
{spaces = computeSpaces(_input.getText());}
;
With Java, I can split the string and give some detailed explanations
String x = "a" + // First
"b" + // Second
"c"; // Third
// x = "abc"
How can I make the equivalence in python?
I could split the string, but I can't make a comment on this like I do with Java.
x = "a" \
"b" \
"c"
I need this feature for explaining regular expression usage.
Pattern p = Pattern.compile("rename_method\\(" + // ignore 'rename_method('
"\"([^\"]*)\"," + // find '"....",'
This
x = ( "a" #foo
"b" #bar
)
will work.
The magic is done here by the parenthesis -- python automatically continues lines inside of any unterminated brakets (([{). Note that python also automatically concatenates strings when they're placed next to each other (We don't even need the + operator!)-- really cool.
If you want to do it specifically for regular expressions, you can do it pretty easily with the re.VERBOSE flag. From the Python docs (scroll down a bit to see the documentation for the VERBOSE flag):
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)
in Perl:
if ($test =~ /^id\:(.*)$/ ) {
print $1;
}
In Python:
import re
test = 'id:foo'
match = re.search(r'^id:(.*)$', test)
if match:
print match.group(1)
In Python, regular expressions are available through the re library.
The r before the string indicates that it is a raw string literal, meaning that backslashes are not treated specially (otherwise every backslash would need to be escaped with another backslash in order for a literal backslash to make its way into the regex string).
I have used re.search here because this is the closest equivalent to Perl's =~ operator. There is another function re.match which does the same thing but only checks for a match starting at the beginning of the string (counter-intuitive to a Perl programmer's definition of "matching"). See this explanation for full details of the differences between the two.
Also note that there is no need to escape the : since it is not a special character in regular expressions.
match = re.match("^id:(.*)$", test)
if match:
print match.group(1)
Use a RegexObject like stated here:
http://docs.python.org/library/re.html#regular-expression-objects
I wrote this Perl to Python regex converter when I had to rewrite a bunch of Perl regex'es (a lot) to Python's re package calls. It covers some basic stuff, but might be still helpful in many ways:
def convert_re (perl_re, string_var='column_name',
test_value=None, expected_test_result=None):
'''
Returns Python regular expression converted to calls of Python `re` library
'''
match = re.match(r"(\w+)/(.+)/(.*)/(\w*)", perl_re)
if not match:
raise ValueError("Not a Perl regex? "+ perl_re)
if not match.group(1)=='s':
raise ValueError("This function is only for `s` Perl regexpes (substitutes), i.e s/a/b/")
flags = match.group(4)
if 'g' in flags:
count=0 # all matches
flags=flags.replace('g','') # remove g
else:
count=1 # one exact match only
if not flags:
flags=0
# change any group references in replacements like \2 to group references like \g<2>
replacement=match.group(3)
replacement = re.sub(r"\$(\d+)", r"\\g<\1>", replacement)
python_code = "re.sub(r'{regexp}', r'{replacement}', {string}{count}{flags})".format(
regexp=match.group(2)
, replacement=replacement
, string=string_var
, count=", count={}".format(count) if count else ''
, flags=", flags={}".format(flags) if flags else ''
)
if test_value:
print("Testing Perl regular expression {} with value '{}':".format(perl_re, test_value))
print("(generated equivalent Python code: {} )".format(python_code))
exec('{}=r"{}"; test_result={}'.format(string_var, test_value, python_code))
assert test_result==expected_test_result, "produced={} expected={}".format(test_result, expected_test_result)
print("Test OK.")
return string_var+" = "+python_code
print convert_re(r"s/^[ 0-9-]+//", test_value=' 2323 col', expected_test_result='col')
print convert_re(r"s/[+-]/_/g", test_value='a-few+words', expected_test_result='a_few_words')