Pyparsing: dblQuotedString parsing differently in nestedExpr - python

I'm working on a grammar to parse search queries (not evaluate them, just break them into component pieces). Right now I'm working with nestedExpr, just to grab the different 'levels' of each term, but I seem to have an issue if the first part of a term is in double quotes.
A simple version of the grammar:
QUOTED = QuotedString(quoteChar = '“', endQuoteChar = '”', unquoteResults = False).setParseAction(remove_curlies)
WWORD = Word(alphas8bit + printables.replace("(", "").replace(")", ""))
WORDS = Combine(OneOrMore(dblQuotedString | QUOTED | WWORD), joinString = ' ', adjacent = False)
TERM = OneOrMore(WORDS)
NESTED = OneOrMore(nestedExpr(content = TERM))
query = '(dog* OR boy girl w/3 ("girls n dolls" OR friends OR "best friend" OR (friends w/10 enemies)))'
Calling NESTED.parseString(query) returns:
[['dog* OR boy girl w/3', ['"girls n dolls"', 'OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
The first dblQuotedString instance is separate from the rest of the term at the same nesting, which doesn't occur to the second dblQuotedString instance, and also doesn't occur if the quoted bit is a QUOTED instance (with curly quotes) rather than dblQuotedString with straight ones.
Is there something special about dblQuotedString that I'm missing?
NOTE: I know that operatorPrecedence can break up search terms like this, but I have some limits on what can be broken apart, so I'm testing if I can use nestedExpr to work within those limits.

nestedExpr takes an optional keyword argument ignoreExpr, to take an expression that nestedExpr should use to ignore characters that would otherwise be interpreted as nesting openers or closers, and the default is pyparsing's quotedString, which is defined as sglQuotedString | dblQuotedString. This is to handle strings like:
(this has a tricky string "string with )" )
Since the default ignoreExpr is quotedString, the ')' in quotes is not misinterpreted as the closing parenthesis.
However, your content argument also matches on dblQuotedString. The leading quoted string is matched internally by nestedExpr by way of skipping over quoted strings that may contain "()"s, then your content is matched, which also matches quoted strings. You can suppress nestedExpr's ignore expression using a NoMatch:
NESTED = OneOrMore(nestedExpr(content = TERM, ignoreExpr=NoMatch()))
which should now give you:
[['dog* OR boy girl w/3',
['"girls n dolls" OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
You'll find more details and examples at https://pythonhosted.org/pyparsing/pyparsing-module.html#nestedExpr

Related

Python: strip function definition using regex

I am a very beginner of programming and reading the book "Automate the boring stuff with Python'. In Chapter 7, there is a project practice: the regex version of strip(). My code below does not work (I use Python 3.6.1). Could anyone help?
import re
string = input("Enter a string to strip: ")
strip_chars = input("Enter the characters you want to be stripped: ")
def strip_fn(string, strip_chars):
if strip_chars == '':
blank_start_end_regex = re.compile(r'^(\s)+|(\s)+$')
stripped_string = blank_start_end_regex.sub('', string)
print(stripped_string)
else:
strip_chars_start_end_regex = re.compile(r'^(strip_chars)*|(strip_chars)*$')
stripped_string = strip_chars_start_end_regex.sub('', string)
print(stripped_string)
You can also use re.sub to substitute the characters in the start or end.
Let us say if the char is 'x'
re.sub(r'^x+', "", string)
re.sub(r'x+$', "", string)
The first line as lstrip and the second as rstrip
This just looks simpler.
When using r'^(strip_chars)*|(strip_chars)*$' string literal, the strip_chars is not interpolated, i.e. it is treated as a part of the string. You need to pass it as a variable to the regex. However, just passing it in the current form would result in a "corrupt" regex because (...) in a regex is a grouping construct, while you want to match a single char from the define set of chars stored in the strip_chars variable.
You could just wrap the string with a pair of [ and ] to create a character class, but if the variable contains, say z-a, it would make the resulting pattern invalid. You also need to escape each char to play it safe.
Replace
r'^(strip_chars)*|(strip_chars)*$'
with
r'^[{0}]+|[{0}]+$'.format("".join([re.escape(x) for x in strip_chars]))
I advise to replace * (zero or more occurrences) with + (one or more occurrences) quantifier because in most cases, when we want to remove something, we need to match at least 1 occurrence of the unnecessary string(s).
Also, you may replace r'^(\s)+|(\s)+$' with r'^\s+|\s+$' since the repeated capturing groups will keep on re-writing group values upon each iteration slightly hampering the regex execution.
#! python
# Regex Version of Strip()
import re
def RegexStrip(mainString,charsToBeRemoved=None):
if(charsToBeRemoved!=None):
regex=re.compile(r'[%s]'%charsToBeRemoved)#Interesting TO NOTE
return regex.sub('',mainString)
else:
regex=re.compile(r'^\s+')
regex1=re.compile(r'$\s+')
newString=regex1.sub('',mainString)
newString=regex.sub('',newString)
return newString
Str=' hello3123my43name is antony '
print(RegexStrip(Str))
Maybe this could help, it can be further simplified of course.

How to use stringed regex as proper regex with raw literalization

I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-
# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I)
if mo:
print(mo.group())
What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -
LINE_CONTAINS foo(+)
Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.
This is the MCVE version of the code that generates these strings that are supposed to act as regexes -
from pyparsing import *
import re
def parse_hrr(received_sentences):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
'PARA_STARTSWITH':
lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words
phrase_expr = infixNotation(phrase_item,
[
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
mrrlist=[]
for t in received_sentences:
t = t.strip()
try:
parsed = line_contents_expr.parseString(t)
temp_regex = parsed[0].generate()
mrrlist.append(temp_regex)
return(mrrlist)
So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.
TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.
Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.
For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.
However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.
But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.
Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).
You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:
regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))', # use raw literals, with no extra quotes!
r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I) # no eval here!
if mo:
print(mo.group())
This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).

pyparsing with starting and ending string being the same

Related to : Python parsing bracketed blocks
I have a file with the following format :
#
here
are
some
strings
#
and
some
others
#
with
different
levels
#
of
#
indentation
#
#
#
So a block is defined by a starting #, and a trailing #. However, the trailing # of the n-1th block is also the starting # of the nth block.
I am trying to write a function that, given this format, would retrieve the content of each blocks, and that could also be recursive.
To start with, I started with regexes but I abandonned quite fast (I think you guessed why), so I tried using pyparsing, yet I can't simply write
print(nestedExpr('#','#').parseString(my_string).asList())
Because it raises a ValueError Exception (ValueError: opening and closing strings cannot be the same).
Knowing that I cannot change the input format, do I have any better option than pyparsing for this one ?
I also tried using this answer : https://stackoverflow.com/a/1652856/740316, and replaced the {/} with #/# yet it fails to parse the string.
Unfortunately (for you), your grouping is not dependent only on the separating '#' characters, but also on the indent levels (otherwise, ['with','different','levels'] would be at the same level as the previous group ['and','some','others']). Parsing indent-sensitive grammars is not a strong suit for pyparsing - it can be done, but it is not pleasant. To do so we will use the pyparsing helper macro indentedBlock, which also requires that we define a list variable that indentedBlock can use for its indentation stack.
See the embedded comments in the code below to see how you might use one approach with pyparsing and indentedBlock:
from pyparsing import *
test = """\
#
here
are
some
strings
#
and
some
others
#
with
different
levels
#
of
#
indentation
#
#
#"""
# newlines are significant for line separators, so redefine
# the default whitespace characters for whitespace skipping
ParserElement.setDefaultWhitespaceChars(' ')
NL = LineEnd().suppress()
HASH = '#'
HASH_SEP = Suppress(HASH + Optional(NL))
# a normal line contains a single word
word_line = Word(alphas) + NL
indent_stack = [1]
# word_block is recursive, since word_blocks can contain word_blocks
word_block = Forward()
word_group = Group(OneOrMore(word_line | ungroup(indentedBlock(word_block, indent_stack))) )
# now define a word_block, as a '#'-delimited list of word_groups, with
# leading and trailing '#' characters
word_block <<= (HASH_SEP +
delimitedList(word_group, delim=HASH_SEP) +
HASH_SEP)
# the overall expression is one large word_block
parser = word_block
# parse the test string
parser.parseString(test).pprint()
Prints:
[['here', 'are', 'some', 'strings'],
['and',
'some',
'others',
[['with', 'different', 'levels'], ['of', [['indentation']]]]]]

python regex in pyparsing

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()
Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.
If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Replace a string located between

Here is my problem: in a variable that is text and contains commas, I try to delete only the commas located between two strings (in fact [ and ]). For example using the following string:
input = "The sun shines, that's fine [not, for, everyone] and if it rains, it Will Be better."
output = "The sun shines, that's fine [not for everyone] and if it rains, it Will Be better."
I know how to use .replace for the whole variable, but I can not do it for a part of it.
There are some topics approaching on this site, but I did not manage to exploit them for my own question, e.g.:
Repeatedly extract a line between two delimiters in a text file, Python
Python finding substring between certain characters using regex and replace()
replace string between two quotes
import re
Variable = "The sun shines, that's fine [not, for, everyone] and if it rains, it Will Be better."
Variable1 = re.sub("\[[^]]*\]", lambda x:x.group(0).replace(',',''), Variable)
First you need to find the parts of the string that need to be rewritten (you do this with re.sub). Then you rewrite that parts.
The function var1 = re.sub("re", fun, var) means: find all substrings in te variable var that conform to "re"; process them with the function fun; return the result; the result will be saved to the var1 variable.
The regular expression "[[^]]*]" means: find substrings that start with [ (\[ in re), contain everything except ] ([^]]* in re) and end with ] (\] in re).
For every found occurrence run a function that convert this occurrence to something new.
The function is:
lambda x: group(0).replace(',', '')
That means: take the string that found (group(0)), replace ',' with '' (remove , in other words) and return the result.
You can use an expression like this to match them (if the brackets are balanced):
,(?=[^][]*\])
Used something like:
re.sub(r",(?=[^][]*\])", "", str)
Here is a non-regex method. You can replace your [] delimiters with say [/ and /], and then split on the / delimiter. Then every odd string in the split list needs to be processed for comma removal, which can be done while rebuilding the string in a list comprehension:
>>> Variable = "The sun shines, that's fine [not, for, everyone] and if it rains,
it Will Be better."
>>> chunks = Variable.replace('[','[/').replace(']','/]').split('/')
>>> ''.join(sen.replace(',','') if i%2 else sen for i, sen in enumerate(chunks))
"The sun shines, that's fine [not for everyone] and if it rains, it Will Be
better."
If you don't fancy learning regular expressions (see other responses on this page), you can use the partition command.
sentence = "the quick, brown [fox, jumped , over] the lazy dog"
left, bracket, rest = sentence.partition("[")
block, bracket, right = rest.partition("]")
"block" is now the part of the string in between the brackets, "left" is what was to the left of the opening bracket and "right" is what was to the right of the opening bracket.
You can then recover the full sentence with:
new_sentence = left + "[" + block.replace(",","") + "]" + right
print new_sentence # the quick, brown [fox jumped over] the lazy dog
If you have more than one block, you can put this all in a for loop, applying the partition command to "right" at every step.
Or you could learn regular expressions! It will be worth it in the long run.

Categories

Resources