strip a verbose python regex - python

I have a verbose python regex string (with lots of whitespace and comments) that I'd like to convert to "normal" style (for export to javascript). In particular, I need this to be quite reliable. If there's any demonstrably correct way to do this, it's what I want. For example, a naive implementation would destroy a regex like r' \# # A literal hash character', which is not OK.
The best way to do this would be to coerce the python re module to give me back a non-verbose representation of my regex, but I don't see a way to do that.

I believe you only need to address these two issues to strip a verbose regex:
delete comments to the end of line
delete unescaped whitespace
try this, which chains the 2 with separate regex substitutions:
import re
def unverbosify_regex_simple(verbose):
WS_RX = r'(?<!\\)((\\{2})*)\s+'
CM_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
return re.sub(WS_RX, "\\1", re.sub(CM_RX, "\\1", verbose))
The above is a simplified version that leaves escaped spaces as-is. The resulting output will be a little harder to read but should work for regex platforms.
Alternatively, for a slightly more complex answer that "unescapes" spaces (i.e., '\ ' => ' ') and returns what I think most people would expect:
import re
def unverbosify_regex(verbose):
CM1_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
CM2_RX = r'(\\)?((\\{2})*)(#)'
WS_RX = r'(\\)?((\\{2})*)(\s)\s*'
def strip_escapes(match):
## if even slashes: delete space and retain slashes
if match.group(1) is None:
return match.group(2)
## if number of slashes is odd: delete slash and keep space (or 'comment')
elif match.group(1) == '\\':
return match.group(2) + match.group(4)
## error
else:
raise Exception
not_verbose_regex = re.sub(WS_RX, strip_escapes,
re.sub(CM2_RX, strip_escapes,
re.sub(CM1_RX, "\\1", verbose)))
return not_verbose_regex
UPDATE: added comments to explain even v. odd slash counting. Fixed first group in CM_RX to retain full 'comment' if slash count is odd.
UPDATE 2: Fixed comments regex, which was not dealing with escaped hashes properly. Should handle both "\# #escaped hash" as well as "# comment with \# escaped hash" and "\\# comment"
UPDATE 3: Added a simplified version that doesn't clean up escaped spaces.
UPDATE 4: Further simplification to eliminate variable-length negative lookbehind (and reverse/reverse trick)

Related

inserting variable into regular expression

I am trying to write code that can extract values from variables in a text file.
so if the file was
"bob= 1255 mike = 13"
when I specified bob as var_name it would extract 1255, and so on.
based my code off of this but it doesnt seem to be working
var_name = 'bob'
regexp = re.compile(r''+var_name+'.*?([0-9.-]+)')
with open("textfile") as s:
for line in s:
match = regexp.match(line)
if match:
print(match.group(1))
var_name = 'mike'
regexp = re.compile(r''+var_name+'.*?([0-9.-]+)')
with open("textfile") as s:
for line in s:
match = regexp.match(line)
if match:
print(match.group(1))
You are using re.match, which only finds things at the start of the string (and mike is not at the start of the string). Use re.search, which finds things at any position.
Slightly off-topic: Note that r'...' does not mean "regexp literal". It means "raw string literal". The purpose of it is to avoid having to escape backslashes inside the string. Now, '' very obviously does not contain any backslashes, so r'' is not at all different from ''. On the other hand, .*?([0-9.-]+) is complex enough that we are not sure whether or not there are (or will be) any backslashes in it - and yet you don't make it into a raw string literal. Puzzling. :) I would have written var_name + r'.*?([0-9.-]+)', without the useless r'' +...
You did not mention what does work / what does not work.
Instead of .*? you should use \s*=\s*. Otherwise you can catch things like #edsakjj*kjn - and I assume you do not want this.
You may also make sure that the number is really a number: -?\d+(\.?\d+)?: optional - (minus, for negative numbers), mandatory digit(s), optionally: decimal mark followed by digit(s).
Test regex
Regarding python code, I am not your guy, sorry :(

Issues with string appending - python

I'm trying to append a string in python and the following code produces
buildVersion =request.values.get("buildVersion", None)
pathToSave = 'processedImages/%s/'%buildVersion
print pathToSave
prints out
processedImages/V41
/
I'm expecting the string to be of format: processedImages/V41/
It doesn't seem to be a new line character.
pathToSave = pathToSave.replace("\n", "")
This dint really help
It might not be relevant to actual question but, in addition to Alex Martelli's answer, I would also check if buildVersion ever exists in the first place, because otherwise all solutions posted here will give you another errors:
import re
buildVersion = request.values.get('buildVersion')
if buildVersion is not None:
return 'processedImages/{}/'.format(re.sub('\W+', '', buildVersion))
else:
return None
It might be a \r or other special whitespace character. Just clean up buildVersion of all such whitespace before executing
pathToSave = 'processedImages/%s/' % buildVersion
You can approach the clean-up task in several ways -- for example, if valid characters in buildVersion are only "word characters" (letters, digits, underscore), something like
import re
buildVersion = re.sub('\W+', '', buildVersion)
would usefully clean up even whitespace inside the string. It's hard to be more specific without knowing exactly what characters you need to accept in buildVersion, of course.

With pyparsing, how do you parse a quoted string that ends with a backslash

I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Repeating a python regular expression until a certain char

I want to get all of the text until a ! appears. Example
some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!
The number of lines before the ! changes so I can't hardcode a reg exp like this
"+\n^.+\n^.+"
I am using re.MULTLINE, but should I be using re.DOTALL?
Thanks
Why does this need a regular expression?
index = str.find('!')
if index > -1:
str = str[index:] # or (index+1) to get rid of the '!', too
So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:
re.match(r'[^!]*', input)
If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:
re.match(r'[^!]*(?=!)', input)
The MULTILINE flag is not needed because there are no anchors (^ and $), and DOTALL isn't needed because there are no dots.
Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" (EAFP), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.
SEPARATOR = u"!"
def process_string(s):
try:
return s[:s.index(SEPARATOR)]
except ValueError:
return s
This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.
ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.
No regular expressions needed; this is a simple problem.
Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.
I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com
Maybe something like:
([^!]*(?=\!))
(Just tried this, seems to work)
It should do the job.
re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)
I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".
So, like this:
>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>>
re.DOTALL should be sufficient:
import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]
some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg

Categories

Resources