How to construct a "leading" equivalent to FollowedBy subclass in PyParsing - python

I'm trying to clean out some code by removing either leading or trailing white space characters using PyParsing. Removing leading white spaces was quite easy as I could make use of the FollowedBy subclass which matches a string but does not include it. Now I would need the same for something that follows my identifying string.
Here a small example:
from pyparsing import *
insource = """
annotation (Documentation(info="
<html>
<b>FOO</b>
</html>
"));
"""
# Working replacement:
HTMLStartref = OneOrMore(White(' \t\n')) + (FollowedBy(CaselessLiteral('<html>')))
## Not working because of non-existing "LeadBy"
# HTMLEndref = LeadBy(CaselessLiteral('</html>')) + OneOrMore(White(' \t\n')) + FollowedBy('"')
out = Suppress(HTMLStartref).transformString(insource)
out2 = Suppress(HTMLEndref).transformString(out)
As output one gets:
>>> print out
annotation (Documentation(info="<html>
<b>FOO</b>
</html>
"));
and should get:
>>> print out2
annotation (Documentation(info="<html>
<b>FOO</b>
</html>"));
I looked at the documentation but could not find a "LeadBy" equivalent to FollowedBy or a way how to achieve that.

What you are asking for is something like "lookbehind", that is, match only if something is preceded by a particular pattern. I don't really have an explicit class for that at the moment, but for what you want to do, you can still transform left-to-right, and just leave in the leading part, and not suppress it, just suppress the whitespace.
Here are a couple of ways to address your problem:
# define expressions to match leading and trailing
# html tags, and just suppress the leading or trailing whitespace
opener = White().suppress() + Literal("<html>")
closer = Literal("</html>") + White().suppress()
# define a single expression to match either opener
# or closer - have to add leaveWhitespace() call so that
# we catch the leading whitespace in opener
either = opener|closer
either.leaveWhitespace()
print either.transformString(insource)
# alternative, if you know what the tag will look like:
# match 'info=<some double quoted string>', and use a parse
# action to extract the contents within the quoted string,
# call strip() to remove leading and trailing whitespace,
# and then restore the original '"' characters (which are
# auto-stripped by the QuotedString class by default)
infovalue = QuotedString('"', multiline=True)
infovalue.setParseAction(lambda t: '"' + t[0].strip() + '"')
infoattr = "info=" + infovalue
print infoattr.transformString(insource)

Related

Python re.sub always returns the original string value and ignores given pattern

My code below
old = """
B07K6VMVL5
B071XQ6H38
B0B7F6Q9BH
B082KTHRBT
B0B78CWZ91
B09T8TJ65B
B09K55Z433
"""
duplicate = """
B0B78CWZ91
B09T8TJ65B
B09K55Z433
"""
final = re.sub(r"\b{}\b".format(duplicate),"",old)
print(final)
The final always prints the old variable values.I want the duplicate values to be removed in the old variable
The block string should not start/end in a new line since it will introduce a \n character. Try with
old = """B07K6VMVL5
B071XQ6H38
B0B7F6Q9BH
B082KTHRBT
B0B78CWZ91 # <-
B09T8TJ65B # <-
B09K55Z433""" # <-
duplicate = """B0B78CWZ91
B09T8TJ65B
B09K55Z433"""
and the result will not equal to the old.
Output
B07K6VMVL5
B071XQ6H38
B0B7F6Q9BH
B082KTHRBT
Alternatively use the block string like this
"""\
B0B78CWZ91
B09T8TJ65B
B09K55Z433\
"""
It seems you can use
final = re.sub(r"(?!\B\w){}(?<!\w\B)".format(re.escape(duplicate.strip())),"",old)
Note several things here:
duplicate.strip() - the whitespaces on both ends may prevent from matching, so strip() removes them from the duplicates
re.escape(...) - if there are special chars they are properly escaped with re.escape
(?!\B\w) and (?<!\w\B) are dynamic adaptive word boundaries. They provide proper matching at word boundaries if required.

Delete CHOOSEN special character function

please help cause Im loosing my mind. I can find similar problems but none of them is that specific.
-Im trying to create a simple compilator in Tkinter, with the function to delete a choosen special character.
-I got the buttons for each character (dot, colon, etc.), and I want to create a function that would take a special character as an argument, then delete it from the ScrolledText field. Here is my best try:
import re
content = 'Test. test . .test'
special = '.'
def delchar(char):
adjustedchar = str("'[" + char + "]'")
p = re.compile(adjustedchar)
newcontent = p.sub('', content)
print(newcontent)
delchar(special)
output (nothing has changed)>>> 'Test. test . .test'
What's going on here? How to make it work? Is there a better solution?
I know that I could create each function for each character (tried, and it's working), but that would create a 10 uneccesary functions. I want to keep it DRY. Also, my next function is gonna do the same thing, just using the user-input.
What doesn't work is that argument. If I would print eg. adjustedchar, I'd get:
'[.]'
It's a format that re.compile() should accept, right?
Your code works the problem is that . (a dot) is a special character.
Change your code to:
import re
content = 'Test. test . .test'
special = '\.'
def delchar(char):
adjustedchar = str("'[" + char + "]'")
p = re.compile(char)
newcontent = p.sub('', content)
print(newcontent)
delchar(special)
You can also check by making special = 't'. In your function you can do checks for the special characters.
You need to re.compile with the pattern you want to match, not with the replace-content:
import re
content = 'Test. test . .test'
special = '.'
def delchar(char):
adjustedchar = str("'[" + char + "]'")
p = re.compile("["+char+"]") # replace the dots, not '.'
newcontent = p.sub(adjustedchar, content) # with adjustedchar,change to '' if you like
print(newcontent)
delchar(special)
Your content does not contain '.' so it does not replace. If you change the pattern to "[.]" you are looking for literal dots to be replaced - not dots flanked by '
Output:
Test'[.]' test '[.]' '[.]'test
You could as well just use string replace: Test. Test . .test'.replace(".","'.'")

How to use stringed regex as proper regex with raw literalization

I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-
# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I)
if mo:
print(mo.group())
What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -
LINE_CONTAINS foo(+)
Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.
This is the MCVE version of the code that generates these strings that are supposed to act as regexes -
from pyparsing import *
import re
def parse_hrr(received_sentences):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
'PARA_STARTSWITH':
lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words
phrase_expr = infixNotation(phrase_item,
[
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
mrrlist=[]
for t in received_sentences:
t = t.strip()
try:
parsed = line_contents_expr.parseString(t)
temp_regex = parsed[0].generate()
mrrlist.append(temp_regex)
return(mrrlist)
So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.
TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.
Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.
For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.
However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.
But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.
Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).
You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:
regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))', # use raw literals, with no extra quotes!
r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I) # no eval here!
if mo:
print(mo.group())
This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).

pyparsing with starting and ending string being the same

Related to : Python parsing bracketed blocks
I have a file with the following format :
#
here
are
some
strings
#
and
some
others
#
with
different
levels
#
of
#
indentation
#
#
#
So a block is defined by a starting #, and a trailing #. However, the trailing # of the n-1th block is also the starting # of the nth block.
I am trying to write a function that, given this format, would retrieve the content of each blocks, and that could also be recursive.
To start with, I started with regexes but I abandonned quite fast (I think you guessed why), so I tried using pyparsing, yet I can't simply write
print(nestedExpr('#','#').parseString(my_string).asList())
Because it raises a ValueError Exception (ValueError: opening and closing strings cannot be the same).
Knowing that I cannot change the input format, do I have any better option than pyparsing for this one ?
I also tried using this answer : https://stackoverflow.com/a/1652856/740316, and replaced the {/} with #/# yet it fails to parse the string.
Unfortunately (for you), your grouping is not dependent only on the separating '#' characters, but also on the indent levels (otherwise, ['with','different','levels'] would be at the same level as the previous group ['and','some','others']). Parsing indent-sensitive grammars is not a strong suit for pyparsing - it can be done, but it is not pleasant. To do so we will use the pyparsing helper macro indentedBlock, which also requires that we define a list variable that indentedBlock can use for its indentation stack.
See the embedded comments in the code below to see how you might use one approach with pyparsing and indentedBlock:
from pyparsing import *
test = """\
#
here
are
some
strings
#
and
some
others
#
with
different
levels
#
of
#
indentation
#
#
#"""
# newlines are significant for line separators, so redefine
# the default whitespace characters for whitespace skipping
ParserElement.setDefaultWhitespaceChars(' ')
NL = LineEnd().suppress()
HASH = '#'
HASH_SEP = Suppress(HASH + Optional(NL))
# a normal line contains a single word
word_line = Word(alphas) + NL
indent_stack = [1]
# word_block is recursive, since word_blocks can contain word_blocks
word_block = Forward()
word_group = Group(OneOrMore(word_line | ungroup(indentedBlock(word_block, indent_stack))) )
# now define a word_block, as a '#'-delimited list of word_groups, with
# leading and trailing '#' characters
word_block <<= (HASH_SEP +
delimitedList(word_group, delim=HASH_SEP) +
HASH_SEP)
# the overall expression is one large word_block
parser = word_block
# parse the test string
parser.parseString(test).pprint()
Prints:
[['here', 'are', 'some', 'strings'],
['and',
'some',
'others',
[['with', 'different', 'levels'], ['of', [['indentation']]]]]]

how to split very long regular expression in python

i have a regular expression which is very long.
vpa_pattern = '(VAP) ([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}): (.*)'
My code to match group as follows:
class ReExpr:
def __init__(self):
self.string=None
def search(self,regexp,string):
self.string=string
self.rematch = re.search(regexp, self.string)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
m = ReExpr()
if m.search(vpa_pattern,line):
print m.group(1)
print m.group(2)
print m.group(3)
I tried to make the regular expression pattern to multiple line in following ways,
vpa_pattern = '(VAP) \
([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}):\
(.*)'
Or Even i tried:
vpa_pattern = re.compile(('(VAP) \
([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}):\
(.*)'))
But above methods are not working. For each group i have a space () after open and close parenthesis. I guess it is not picking up when i split to multiple lines.
Look at re.X flag. It allows comments and ignores white spaces in regex.
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
Python allows writing text strings in parts if enclosed in parenthesis:
>>> text = ("alfa" "beta"
... "gama")
...
>>> text
'alfabetagama'
or in your code:
text = ("alfa" "beta"
"gama" "delta"
"omega")
print text
will print
"alfabetagamadeltaomega"
Its actually quite simple. You already use the {} notation. Use it again. So instead of:
'([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}):'
which is just a repeat of [0-9A-Fa-f]{2}: 6 times, you can use:
'([0-9A-Fa-f]{2}:){6}'
We can even simplify it further by using \d to represent digits:
'([\dA-Fa-f]{2}:){6}'
NOTE: Depending on what re function you use, you can pass in re.IGNORE_CASE and simplify that chunk down to [\da-f]{2}:
So your final regex is:
'(VAP) ([\dA-Fa-f]{2}:){6} (.*)'

Categories

Resources