I'm building a lexer using ply in python. I have 2 tokens called TkConjuncion (which refers to logical and) and TkDisjuncion (which refers to logical and).
The rules for both of them are written as follows (there are other rules as well but irrelevant):
t_TkDisjuncion = '\\\/'
t_TkConjuncion = '\/\\'
Where \\\/ is \/ and \/\\ is /\. But when I test my code it says:
ERROR: Invalid regular expression for rule 't_TkConjuncion'.
unbalanced parenthesis
The \\ is read by the lexer as \, so it accepts t_TkDisjuncion, but I don't understand why it doesn't accept the other token. I've been researching on the web but I found nothing.
Any ideas of why this is happening?
I don't know, but I bet there's more than 1 level of backslash interpretation going on. Python certainly does a level when it compiles the string literals. The actual strings you create in your example are
\/
and
/\
If ply goes on to embed those in a regular expression without escaping them first (this is the part I don't know about - but think it's likely), then the trailing backslash in the second string will act to escape whatever follows it. Which is likely to be a right parenthesis, and hence an "unbalanced parentheses" complaint.
Anyway, try making these raw strings instead:
t_TkDisjuncion = r'\\\/'
t_TkConjuncion = r'\/\\'
The "r" prefix prevents Python from treating backslashes specially, so that the actual strings those lines create are
\\\/
and
\/\\
If those are then embedded in a regular expression without escaping them first (which is up to ply, not up to you), they'll do what you intended.
EDIT I'm pretty sure that's it. Looking at the ply docs, tokens are indeed specified using regexps, and the docs recommend using raw strings because of this (to avoid the double interpretation of backslashes I talked about above).
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am reading the Shinken source code in shinken/misc/perfdata.py and i finally find a regex that i can not understand. like this:
metric_pattern = re.compile('^([^=]+)=([\d\.\-\+eE]+)([\w\/%]*);?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE]+)?;?([\d\.\-\+eE]+)?;?\s*')
what confused me is that what does \/ mean in ([\w\/%]*)?
You're rightfully confused, because that regex must have been written by someone who doesn't know Python regexes well.
In some languages (e.g. JavaScript), regexes are delimited by slashes. That means that if you need an actual slash in your regex, you have to escape it. Since Python doesn't use slashes, there's no need to escape the slash (but it doesn't cause an error, either).
Much more worrisome is that the author failed to use a raw string. In many cases, that won't matter (because Python will treat "\d" as "\\d" which then correctly translates to the regex \d, but in other cases, it will cause problems. One example is "\b" which means "a backspace character" and not "a word boundary anchor" like the regex \b would.
Also, the author has escaped a lot of characters that didn't need escaping at all. The entire regex could be rewritten as
metric_pattern = re.compile(r'^([^=]+)=([\d.+eE-]+)([\w/%]*);?([\d.+eE:~#-]+)?;?([\d.+eE:~#-]+)?;?([\d.+eE-]+)?;?([\d.+eE-]+)?;?\s*')
and even then, I'm surprised that it works at all. Looks very chaotic to me and is definitely not foolproof. For example, there appears to be a big potential for catastrophic backtracking meaning that users could freeze your server with malicious input.
This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 7 months ago.
str = r'c:\path\to\folder\' # my comment
IDE: Eclipse
Python2.6
When the last character in the string is a backslash, it seems like it will escape the last single quote and treat my comment as part of the string. But the raw string is supposed to ignore all escape characters, right? What could be wrong? Thanks.
Raw string literals don't treat backslashes as initiating escape sequences except when the immediately-following character is the quote-character that is delimiting the literal, in which case the backslash does escape it.
The design motivation is that raw string literals really exist only for the convenience of entering regular expression patterns – that is all, no other design objective exists for such literals. And RE patterns never need to end with a backslash, but they might need to include all kinds of quote characters, whence the rule.
Many people do try to use raw string literals to enable them to enter Windows paths the way they're used to (with backslashes) – but as you've noticed this use breaks down when you do need a path to end with a backslash. Usually, the simplest solution is to use forward slashes, which Microsoft's C runtime and all version of Python support as totally equivalent in paths:
s = 'c:/path/to/folder/'
(side note: don't shadow builtin names, like str, with your own identifiers – it's a horrible practice, without any upside, and unless you get into the habit of avoiding that horrible practice one day you'll find yourseld with a nasty-to-debug problem, when some part of your code tramples over a builtin name and another part needs to use the builtin name in its real meaning).
It's IMHO an inconsistency in Python, but it's described in the documentation. Go to the second last paragraph:
http://docs.python.org/reference/lexical_analysis.html#string-literals
r"\" is not a valid string literal
(even a raw string cannot end in an
odd number of backslashes)
I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott
There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).
Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').
What is the best literal delimiter in Python and why? Single ' or double "? And most important, why?
I'm a beginner in Python and I'm trying to stick with just one. I know that in PHP, for example " is preferred, because PHP does not try to search for the 'string' variable. Is the same case in Python?
' because it's one keystroke less than ". Save your wrists!
They're otherwise identical (except you have to escape whichever you choose to use, if they appear inside the string).
Consider these strings:
"Don't do that."
'I said, "okay".'
"""She said, "That won't work"."""
Which quote is "best"?
Semantically there is no difference in Python; use either. Python also provides the handy triple string delimiter """ or ''' which can simplify multi-line quotes. There is also the raw string literal (r"..." or r'...') to inhibit \ escapes. The Language Reference has all the details.
For string constants containing a single quote use the double quote as delimiter.
The other way around, if you need a double quote inside.
Quick, shiftless typing leads to single quote delimiters.
>>> "it's very simple"
>>> 'reference to the "book"'
Single and double quotes act identically in Python. Escapes (\n) always work, and there is no variable interpolation. (If you don't want escapes, you can use the r flag, as in r"\n".)
Since I'm coming from a Perl background, I have a habit of using single quotes for plain strings and double-quotes for formats used with the % operator. But there is really no difference.
Other answers are about nested quoting. Another point of view I've come across, but I'm not sure I subscribe to, is to use single-quotes(') for characters (which are strings, but ord/chr are quick picky) and to use double-quotes for strings. Which disambiguates between a string that is supposed to be one character and one that just happens to be one character.
Personally I find most touch typists aren't affected noticably by the "load" of using the shift-key. YMMV on that part. Going down the "it's faster to not use the shift" is a slippery slope. It's also faster to use hyper-condensed variable/function/class/module names. Everyone just so loves the fast and short 8.3 DOS files names too. :) Pick what makes semantic sense to you, then optimize.
This is a rule I have heard about:
") If the string is for human consuption, that is interface text or output, use ""
') If the string is a specifier, like a dictionary key or an option, use ''
I think a well-enforced rule like that can make sense for a project, but it's nothing that I would personally care much about. I like the above, since I read it, but I always use "" (since I learned C first wayy back?).
I don't think there is a single best string delimiter. I like to use different delimiters to indicate different kinds of string. Specifically, I like to use "..." to delimit stings that are used for interpolation or that are natural language messages, and '...' to delimit small symbol-like strings. This gives me a subtle extra clue to the expected use for the string literal.
I try to always use raw strings (r"...") for regular expressions because (1) I don't have to escape backslash characters and (2) my editor recognises this convention and does syntax highlighting inside the regex.
The stylistic issues of single- vs. double-quotes are covered in question 56011.
I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)