Why my regex can not work in Python?

Why my regex can not work in Python? - python

What I am trying to match is something like this:
public FUNCTION_NAME
FUNCTION_NAME proc near
......
FUNCTION_NAME endp
FUNCTION_NAME can be :
version_etc
version_etc_arn
version_etc_ar
and my pattern is:
pattern = "public\s+" + func_name + "[\s\S]*" + func_name + "\s+endp"
and match with:
match = re.findall(pattern, content)
So currently I find if the fuction_name equals version_etc, then it will match
all the version_etc_arn, version_etc_ar and version_etc.....
which means if the pattern is :
"public\s+" + "version_etc" + "[\s\S]*" + "version_etc" + "\s+endp"
then it will match:
public version_etc_arn
version_etc_arn proc near
......
version_etc_arn endp
public version_etc_ar
version_etc_ar proc near
......
version_etc_ar endp
public version_etc
version_etc proc near
......
version_etc endp
And I am trying to just match:
public version_etc
version_etc proc near
......
version_etc endp
Am I wrong? Could anyone give me some help?
THank you!

[\s\S]* matches 0-or-more of anything, including the _arn that you are trying to exclude. Thus, you need to require a whitespace after func_name:
pattern = r"(?sm)public\s+{f}\s.*^{f}\s+endp".format(f=func_name)

Related

Use regex to match multiple words in sequence

I am trying to create a function that will return a string from the text based on these conditions:
If 'recurring payment authorized on' in the string, get the 1st text after 'on'
If 'recurring payment' in the string, get everything before
Currently I have written the following:
#will be used in an apply statement for a column in dataframe
def parser(x):
x_list = x.split()
if " recurring payment authorized on " in x and x_list[-1]!= "on":
return x_list[x_list.index("on")+1]
elif " recurring payment" in x:
return ' '.join(x_list[:x_list.index("recurring")])
else:
return None
However this code looks awkward and is not robust. I want to use regex to match those strings.
Here are some examples of what this function should return:
recurring payment authorized on usps abc should return usps
usps recurring payment abc should return usps
Any help on writing regex for this function will be appreciated. The input string will only contain text; there will be no numerical and special characters

Using Regex with lookahead and lookbehind pattern matching
import re
def parser(x):
# Patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
Output
In text: recurring payment authorized on usps abc
Found: usps
In text: usps recurring payment abc
Found: usps
In text: recurring payment authorized on att xxx xxx
Found: att
In text: recurring payment authorized on 25.05.1980 xxx xxx
Found: 25.05.1980
In text: att recurring payment xxxxx
Found: att
In text: 12.14.14. att recurring payment xxxxx
Found: 12.14.14. att
Explanation
Lookahead and Lookbehind pattern matching
Regex Lookbehind
(?<=foo) Lookbehind Asserts that what immediately precedes the current
position in the string is foo
So in pattern: r'(?<= authorized on )(.*?)(\s+)'
foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace
So the above causes (.*?) to capture all characters after " authorized on " until the first whitespace character.
Regex Lookahead
(?=foo) Lookahead Asserts that what immediately follows the current position in the string is foo
So with: r'^(.*?)\s(?=recurring payment)'
foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space
Thus, (.*?) will match all characters from beginning of string until we get whitespace followed by "recurring payment"
Better Performance
Desirable since you're applying to Dataframe which may have lots of columns.
Take the pattern compilation out of the parser and place it in the module (33% reduction in time).
def parser(x):
# Use predined patterns (pattern_on, pattern_recur) from globals
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
# Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]

I am not sure that this level of complexity requires RegEx.
Hoping that RegEx is not a strict requirement for you here's a solution not using it:
examples = [
'stuff more stuff recurring payment authorized on ExampleA useless data',
'other useless data ExampleB recurring payment',
'Not matching phrase payment example authorized'
]
def extract_data(phrase):
result = None
if "recurring payment authorized on" in phrase:
result = phrase.split("recurring payment authorized on")[1].split()[0]
elif "recurring payment" in phrase:
result = phrase.split("recurring payment")[0]
return result
for example in examples:
print(extract_data(example))
Output
ExampleA
other useless data ExampleB
None

Not sure if this is any faster, but Python has conditionals:
If authorized on is present then
match the next substring of non-space characters else
match everything that occurs before recurring
Note that the result will be in capturing group 2 or 3 depending on which matched.
import re
def xstr(s):
if s is None:
return ''
return str(s)
def parser(x):
# Patterns to search
pattern = re.compile(r"(authorized\son\s)?(?(1)(\S+)|(^.*) recurring)")
m = pattern.search(t)
if m:
return xstr(m.group(2)) + xstr(m.group(3))
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))

You can do this with a single regex and without explicit lookahead nor lookbehind.
Please let me know if this works, and how it performs against #DarryIG's solution.
import re
from collections import namedtuple
ResultA = namedtuple('ResultA', ['s'])
ResultB = namedtuple('ResultB', ['s'])
RX = re.compile('((?P<init>.*) )?recurring payment ((authorized on (?P<authorized>\S+))|(?P<rest>.*))')
def parser(x):
'''https://stackoverflow.com/questions/59600852/use-regex-to-match-multiple-words-in-sequence
>>> parser('recurring payment authorized on usps abc')
ResultB(s='usps')
>>> parser('usps recurring payment abc')
ResultA(s='usps')
>>> parser('recurring payment authorized on att xxx xxx')
ResultB(s='att')
>>> parser('recurring payment authorized on 25.05.1980 xxx xxx')
ResultB(s='25.05.1980')
>>> parser('att recurring payment xxxxx')
ResultA(s='att')
>>> parser('12.14.14. att recurring payment xxxxx')
ResultA(s='12.14.14. att')
'''
m = RX.match(x)
if m is None:
return None # consider ValueError
recurring = m.groupdict()['init'] or m.groupdict()['rest']
authorized = m.groupdict()['authorized']
if (recurring is None) == (authorized is None):
raise ValueError('invalid input')
if recurring is not None:
return ResultA(recurring)
else:
return ResultB(authorized)

lark-parser indented DSL and multiline documentation strings

I'm trying to implement a record definition DSL using lark. It is based on indentation, which makes things a bit more complex.
Lark is a great tool, but I'm facing some dificulteis.
Here is a snippet of the DSL I'm implementing:
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
field2 Datetime:
"""Attributes should also have
multiline documentation"""
field3 String "inline documentation also works"
and here is the grammar used:
?start: (_NEWLINE | redorddef)*
simple_type: NAME
multiline_doc: MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc
attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody
MULTILINE_STRING: /"""([^"\\]*(\\.[^"\\]*)*)"""/
INLINE_STRING: /"([^"\\]*(\\.[^"\\]*)*)"/
_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+
%import common.CNAME -> NAME
%import common.INT
%ignore /[\t \f]+/ // WS
%ignore /\\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT
It works fine for multiline string docs for the record definition, works fine for inline attribute definition, but doesn't work for attribute multiline string doc.
The code I use to execute is this:
import sys
import pprint
from pathlib import Path
from lark import Lark, UnexpectedInput
from lark.indenter import Indenter
scheman_data_works = '''
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
# field2 Datetime:
# """Attributes should also have
# multiline documentation"""
field3 String "inline documentation also works"
'''
scheman_data_wrong = '''
record Order :
"""Order record documentation
should have arbitrary size"""
field1 Int
field2 Datetime:
"""Attributes should also have
multiline documentation"""
field3 String "inline documentation also works"
'''
grammar = r'''
?start: (_NEWLINE | redorddef)*
simple_type: NAME
multiline_doc: MULTILINE_STRING _NEWLINE
inline_doc: INLINE_STRING
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT | inline_doc
attribute_name: NAME
attribute_simple_type: attribute_name simple_type [element_doc] _NEWLINE
attributes: attribute_simple_type+
_recordbody: _NEWLINE _INDENT [multiline_doc] attributes _DEDENT
redorddef: "record" NAME ":" _recordbody
MULTILINE_STRING: /"""([^"\\]*(\\.[^"\\]*)*)"""/
INLINE_STRING: /"([^"\\]*(\\.[^"\\]*)*)"/
_WS_INLINE: (" "|/\t/)+
COMMENT: /#[^\n]*/
_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+
%import common.CNAME -> NAME
%import common.INT
%ignore /[\t \f]+/ // WS
%ignore /\\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
%declare _INDENT _DEDENT
'''
class SchemanIndenter(Indenter):
NL_type = '_NEWLINE'
OPEN_PAREN_types = ['LPAR', 'LSQB', 'LBRACE']
CLOSE_PAREN_types = ['RPAR', 'RSQB', 'RBRACE']
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 4
scheman_parser = Lark(grammar, parser='lalr', postlex=SchemanIndenter())
print(scheman_parser.parse(scheman_data_works).pretty())
print("\n\n")
print(scheman_parser.parse(scheman_data_wrong).pretty())
and the result is:
redorddef
Order
multiline_doc """Order record documentation
should have arbitrary size"""
attributes
attribute_simple_type
attribute_name field1
simple_type Int
attribute_simple_type
attribute_name field3
simple_type String
inline_doc "inline documentation also works"
Traceback (most recent call last):
File "schema_parser.py", line 83, in <module>
print(scheman_parser.parse(scheman_data_wrong).pretty())
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lark.py", line 228, in parse
return self.parser.parse(text)
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parser_frontends.py", line 38, in parse
return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 68, in parse
for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/indenter.py", line 31, in process
for token in stream:
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 319, in lex
for x in l.lex(stream, self.root_lexer.newline_types, self.root_lexer.ignore_types):
File "/Users/branquif/Dropbox/swf_projects/schema-manager/.venv/lib/python3.7/site-packages/lark/lexer.py", line 167, in lex
raise UnexpectedCharacters(stream, line_ctr.char_pos, line_ctr.line, line_ctr.column, state=self.state)
lark.exceptions.UnexpectedCharacters: No terminal defined for 'f' at line 11 col 2
field3 String "inline documentation also
^
I undestand indented grammars are more complex, and lark seems to make it easier, but cannot find the mistake here.
PS: I also tried pyparsing, without success wit this same scenario, and would be hard for me to move to PLY, given the amount of code that will probably be needed.

The bug comes from misplaced _NEWLINE terminals. Generally, it's recommended to make sure rules are balanced, in terms of their role in the grammar. So here's how you should have defined element_doc:
?element_doc: ":" _NEWLINE _INDENT multiline_doc _DEDENT
| inline_doc _NEWLINE
Notice the added newline, which means that no matter which of the two options the parser takes, it ends in a similar state, syntax-wise (_DEDENT also matches a newline).
The second change, as a consequence of the first one, is:
attribute_simple_type: attribute_name simple_type (element_doc|_NEWLINE)
As element_doc already handles newlines, we shouldn't try to match it twice.

You mentioned trying pyparsing, otherwise I would have left your question alone.
Whitespace-sensitive parsing is not great with pyparsing, but it does make an effort on this kind of case, using pyparsing.indentedBlock. There is some level of distress in writing this, but it can be done.
import pyparsing as pp
COLON = pp.Suppress(':')
tpl_quoted_string = pp.QuotedString('"""', multiline=True) | pp.QuotedString("'''", multiline=True)
quoted_string = pp.ungroup(tpl_quoted_string | pp.quotedString().addParseAction(pp.removeQuotes))
RECORD = pp.Keyword("record")
ident = pp.pyparsing_common.identifier()
field_expr = (ident("name")
+ ident("type") + pp.Optional(COLON)
+ pp.Optional(quoted_string)("docstring"))
indent_stack = []
STACK_RESET = pp.Empty()
def reset_indent_stack(s, l, t):
indent_stack[:] = [pp.col(l, s)]
STACK_RESET.addParseAction(reset_indent_stack)
record_expr = pp.Group(STACK_RESET
+ RECORD - ident("name") + COLON + pp.Optional(quoted_string)("docstring")
+ (pp.indentedBlock(field_expr, indent_stack))("fields"))
record_expr.ignore(pp.pythonStyleComment)
If your example is written to a variable 'sample', do:
print(record_expr.parseString(sample).dump())
And get:
[['record', 'Order', 'Order record documentation\n should have arbitrary size', [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]]]
[0]:
['record', 'Order', 'Order record documentation\n should have arbitrary size', [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]]
- docstring: 'Order record documentation\n should have arbitrary size'
- fields: [['field1', 'Int'], ['field2', 'Datetime', 'Attributes should also have\n multiline documentation'], ['field3', 'String', 'inline documentation also works']]
[0]:
['field1', 'Int']
- name: 'field1'
- type: 'Int'
[1]:
['field2', 'Datetime', 'Attributes should also have\n multiline documentation']
- docstring: 'Attributes should also have\n multiline documentation'
- name: 'field2'
- type: 'Datetime'
[2]:
['field3', 'String', 'inline documentation also works']
- docstring: 'inline documentation also works'
- name: 'field3'
- type: 'String'
- name: 'Order'

Use Rexeg to find Specific String and Between Quotes

I am trying to parse strings that look like this.
<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,">
<Objective Type="Availability">
<Goal>99.99</Goal>
<Actual>100.00</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2880</Checks>
</Objective>
<Objective Type="Uptime">
<Goal/>
<Actual/>
<Compliant/>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>
I want to use regex to find the position of 'Description' and then string between the quotes, so I want 'Get Metadata'. Then, I want to find the position of 'From' and get the string between the quotes, so I want this '2019-01-16 00:00'. Finally, I want to find the position of 'Thru' and get the string between the quotes, so I want this '2019-01-16 23:59'. How can I do this with 3 separate regex commands and parse this into 3 separate strings? TIA.

You can do this with 1 regex pattern
pattern = re.compile('Description="(.*)" From="(.*)" Thru="(.*)" obj')
for founds in re.findall(pattern=pattern, string=string):
desc, frm, thru = founds
print(desc)
print(frm)
print(thru)
# ouput
# Get Metadata
# 2019-01-16 00:00
# 2019-01-16 23:59
Or you can do the same step with different patterns
pattern_desc = re.compile('Description="(.*)" From')
pattern_frm = re.compile('From="(.*)" Thru')
pattern_thru = re.compile('Thru="(.*)" obj')
re.findall(pattern_desc, string)
# output: ['Get Metadata']
re.findall(pattern_frm, string)
# output: ['2019-01-16 00:00']
re.findall(pattern_thru, string)
# output: ['2019-01-16 23:59']

This regex should give you the content of description, the others should be similar:
'Description="([\w\s]+)" From'

I put together a little working example with a regex to get the data you are looking for.
import re
long_string = '''
<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,">
<Objective Type="Availability">
<Goal>99.99</Goal>
<Actual>100.00</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2880</Checks>
</Objective>
<Objective Type="Uptime">
<Goal/>
<Actual/>
<Compliant/>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>
'''
match = re.search('Description=\"(.+?)\" From=\"(.+?)\" Thru=\"(.+?)\"', long_string)
if match:
print(match.group(1))
print(match.group(2))
print(match.group(3))
It gives this output:
Get Metadata
2019-01-16 00:00
2019-01-16 23:59
Hope this helps.

Your three regex you need for capturing the mentioned values will be this,
Description="([^"]*)"
From="([^"]*)"
Thru="([^"]*)"
Which you can generate dynamically through a function and re-use it for finding value for any type of data. Try this python code demo,
import re
def getValue(str, key):
m = re.search(key + '="([^"]*)"',str)
if m:
return m.group(1)
s = '''<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,">
<Objective Type="Availability">
<Goal>99.99</Goal>
<Actual>100.00</Actual>
<Compliant>Yes</Compliant>
<Errors>0</Errors>
<Checks>2880</Checks>
</Objective>
<Objective Type="Uptime">
<Goal/>
<Actual/>
<Compliant/>
<Errors>0</Errors>
<Checks>0</Checks>
</Objective>'''
print('Description: ' + getValue(s,'Description'))
print('From: ' + getValue(s,'From'))
print('Thru: ' + getValue(s,'Thru'))
Prints,
Description: Get Metadata
From: 2019-01-16 00:00
Thru: 2019-01-16 23:59

In pure python, it should be something like this:
xml = '<Report Type="Final Report" SiteName="Get Dataset" Name="Get Metadata" Description="Get Metadata" From="2019-01-16 00:00" Thru="2019-01-16 23:59" obj_device="479999" locations="69,31,"><Objective Type="Availability"><Goal>99.99</Goal><Actual>100.00</Actual><Compliant>Yes</Compliant><Errors>0</Errors><Checks>2880</Checks></Objective><Objective Type="Uptime"><Goal/><Actual/><Compliant/><Errors>0</Errors><Checks>0</Checks></Objective>'
report = xml.split('>')[0]
description = report.split("Description=\"")[1].split("\" From=\"")[0]
from_ = report.split("From=\"")[1].split("\" Thru=\"")[0]
thru = report.split("Thru=\"")[1].split("\" obj_device=\"")[0]

How to get a parse in a bracketed format (without POS tags)?

I want to parse a sentence to a binary parse of this form (Format used in the SNLI corpus):
sentence:"A person on a horse jumps over a broken down airplane."
parse: ( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )
I'm unable to find a parser which does this.
note: This question has been asked earlier(How to get a binary parse in Python). But the answers are not helpful. And I was unable to comment because I do not have the required reputation.

Here is some sample code which will erase the labels for each node in the tree.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class PrintTreeWithoutLabelsExample {
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse");
// use faster shift reduce parser
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
props.setProperty("parse.maxlen", "100");
props.setProperty("parse.binaryTrees", "true");
// set up Stanford CoreNLP pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// build annotation for text
Annotation annotation = new Annotation("The red car drove on the highway.");
// annotate the review
pipeline.annotate(annotation);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceConstituencyParse = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
for (Tree subTree : sentenceConstituencyParse.subTrees()) {
if (!subTree.isLeaf())
subTree.setLabel(CoreLabel.wordFromString(""));
}
TreePrint treePrint = new TreePrint("oneline");
treePrint.printTree(sentenceConstituencyParse);
}
}
}

I analyzed the accepted version and as I needed something in python, I made a simple function, that creates the same results. For parsing the sentences I adapted the version found at the referenced link.
import re
import string
from stanfordcorenlp import StanfordCoreNLP
from nltk import Tree
from functools import reduce
regex = re.compile('[%s]' % re.escape(string.punctuation))
def parse_sentence(sentence):
nlp = StanfordCoreNLP(r'./stanford-corenlp-full-2018-02-27')
sentence = regex.sub('', sentence)
result = nlp.parse(sentence)
result = result.replace('\n', '')
result = re.sub(' +',' ', result)
nlp.close() # Do not forget to close! The backend server will consume a lot memery.
return result.encode("utf-8")
def binarize(parsed_sentence):
sentence = sentence.replace("\n", "")
for pattern in ["ROOT", "SINV", "NP", "S", "PP", "ADJP", "SBAR",
"DT", "JJ", "NNS", "VP", "VBP", "RB"]:
sentence = sentence.replace("({}".format(pattern), "(")
sentence = re.sub(' +',' ', sentence)
return sentence
Neither my or the accepted version deliver the same results as presented in the SNLI or MultiNLI corpus, as they gather two single leafs of the tree together to one. An example from the MultiNLI corpus shows
"( ( The ( new rights ) ) ( are ( nice enough ) ) )",
where as booth answers here return
'( ( ( ( The) ( new) ( rights)) ( ( are) ( ( nice) ( enough)))))'.
I am not an expert in NLP, so I hope this does not make any difference. At least it does not for my applications.

Regexp: match quotes, that are not in substitution

My template engine translates
"some data #{substitution_expression} some other data"
into
"some data" + (substitution_expression) + "some other data"
But if "some data" or "some other data" would have double quotes inside, the evalution fails.
I have to add slashes before these quotes, but I can not come up with the right regular expression for that.
Any help?
UPDATE:
This is how the template engine works:
It gets a template string, e.g.
template = 'template string "quoted text" #{expression}'
It changes the template string by a simple regexp to:
template = '"%s"' % re.compile(r'\#{(.*)}').match(r'" + (\1) + "', template)
# template == "template string "quoted text"" + (expression) + ""
# here is a problem with a "quoted text" - it needs \ before quotes`
This string is being inserted into lambda, and the result code string is being evaled:
return eval("lambda tpl_args: %s" % modified_template_string)
The lambda is being called later at the program with some tpl_args to generate a result string.

Have you tried the re.DOTALL flag?
re.compile(r'\#{(.*)}', re.DOTALL)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why my regex can not work in Python? - python

[\s\S]* matches 0-or-more of anything, including the _arn that you are trying to exclude. Thus, you need to require a whitespace after func_name: pattern = r"(?sm)public\s+{f}\s.*^{f}\s+endp".format(f=func_name)

Related

Use regex to match multiple words in sequence

lark-parser indented DSL and multiline documentation strings

Use Rexeg to find Specific String and Between Quotes

How to get a parse in a bracketed format (without POS tags)?

Regexp: match quotes, that are not in substitution

Categories

Resources