Regex From .NET to Python - python

I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott

There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).

Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').

Related

Should I use raw string by default?

I know raw string, like r'hello world', prevents escaping.
Is it a good practice to always prepend the r symbol even if the string doesn't have any escaping sequences?
Say my exception needs some string literal explanation, I need to connect to a website whose url is a string literal. They don't have backslash. Are there any performance differences between raw string and regular string?
The r sigil means "backslashes in this string are literal backslashes". Putting this sigil on a string which doesn't contain any backslashes is harmless but sometimes mildly confusing to a human reader. A better approach is probably to only use this sigil when you actually need it.
Situations where the string may not contain backslashes at the moment, but where you might expect to add one in the future, such as in regular expressions and Windows file paths, would probably qualify as useful exceptions.
re.findall(r'hello', string) # what if we switch to r'hello\.'?
with open(r'file.txt') as handle: # what if we switch to r'sub\file.txt'?
It would be easy to forget to add the r when you add a backslash, so supplying it in advance has some merit here.
You can do that in Python. But I don't recommend that because if you add something like '\n', it won't work well. You can use that in Regex and paths on Windows.

Ply unbalanced parentheses in regular expression

I'm building a lexer using ply in python. I have 2 tokens called TkConjuncion (which refers to logical and) and TkDisjuncion (which refers to logical and).
The rules for both of them are written as follows (there are other rules as well but irrelevant):
t_TkDisjuncion = '\\\/'
t_TkConjuncion = '\/\\'
Where \\\/ is \/ and \/\\ is /\. But when I test my code it says:
ERROR: Invalid regular expression for rule 't_TkConjuncion'.
unbalanced parenthesis
The \\ is read by the lexer as \, so it accepts t_TkDisjuncion, but I don't understand why it doesn't accept the other token. I've been researching on the web but I found nothing.
Any ideas of why this is happening?
I don't know, but I bet there's more than 1 level of backslash interpretation going on. Python certainly does a level when it compiles the string literals. The actual strings you create in your example are
\/
and
/\
If ply goes on to embed those in a regular expression without escaping them first (this is the part I don't know about - but think it's likely), then the trailing backslash in the second string will act to escape whatever follows it. Which is likely to be a right parenthesis, and hence an "unbalanced parentheses" complaint.
Anyway, try making these raw strings instead:
t_TkDisjuncion = r'\\\/'
t_TkConjuncion = r'\/\\'
The "r" prefix prevents Python from treating backslashes specially, so that the actual strings those lines create are
\\\/
and
\/\\
If those are then embedded in a regular expression without escaping them first (which is up to ply, not up to you), they'll do what you intended.
EDIT I'm pretty sure that's it. Looking at the ply docs, tokens are indeed specified using regexps, and the docs recommend using raw strings because of this (to avoid the double interpretation of backslashes I talked about above).

basic regex operations in Python

I'm following a tutorial about regular expression. I'm getting an error when doing the following:
regex = r'(+|-)?\d*\.?\d*'
Apparently, Python doesn't like (+|-). What could be the problem?
Also, what could be the issue with not adding r ahead of the regex?
You need to escape + in regular expressions to get a literal +, because it usually means "one or more instances of something":
regex = r'(\+|-)?\d*\.?\d*'
And r makes it a "raw" string. Without the r, the regular expression escape sequences will be interpreted as string escape sequences, and they'll cause all sorts of problems. (\b being a backspace instead of a word boundary, and that kind of thing.)
+ is a special character. You can use brackets to specify a range of characters, which is better than using an "or" with the pipe character in this case.:
regex = r'([+-])?\d*\.?\d*'
Otherwise, you just need to escape it in your original version:
regex = r'(\+|-)?\d*\.?\d*'
Using the r is the preferred way of specifying a regex string in python because it indicates a raw string, which should not be interpreted and reduces the amount of escaping you must perform with backslashes. It is just a python regex idiom you will see everywhere.
r'(\+|-)?\d*\.?\d*'
#'(\\+|-)?\\d*\\.?\\d*'

Python raw literal string [duplicate]

This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 7 months ago.
str = r'c:\path\to\folder\' # my comment
IDE: Eclipse
Python2.6
When the last character in the string is a backslash, it seems like it will escape the last single quote and treat my comment as part of the string. But the raw string is supposed to ignore all escape characters, right? What could be wrong? Thanks.
Raw string literals don't treat backslashes as initiating escape sequences except when the immediately-following character is the quote-character that is delimiting the literal, in which case the backslash does escape it.
The design motivation is that raw string literals really exist only for the convenience of entering regular expression patterns – that is all, no other design objective exists for such literals. And RE patterns never need to end with a backslash, but they might need to include all kinds of quote characters, whence the rule.
Many people do try to use raw string literals to enable them to enter Windows paths the way they're used to (with backslashes) – but as you've noticed this use breaks down when you do need a path to end with a backslash. Usually, the simplest solution is to use forward slashes, which Microsoft's C runtime and all version of Python support as totally equivalent in paths:
s = 'c:/path/to/folder/'
(side note: don't shadow builtin names, like str, with your own identifiers – it's a horrible practice, without any upside, and unless you get into the habit of avoiding that horrible practice one day you'll find yourseld with a nasty-to-debug problem, when some part of your code tramples over a builtin name and another part needs to use the builtin name in its real meaning).
It's IMHO an inconsistency in Python, but it's described in the documentation. Go to the second last paragraph:
http://docs.python.org/reference/lexical_analysis.html#string-literals
r"\" is not a valid string literal
(even a raw string cannot end in an
odd number of backslashes)

Python: what kind of literal delimiter is "better" to use?

What is the best literal delimiter in Python and why? Single ' or double "? And most important, why?
I'm a beginner in Python and I'm trying to stick with just one. I know that in PHP, for example " is preferred, because PHP does not try to search for the 'string' variable. Is the same case in Python?
' because it's one keystroke less than ". Save your wrists!
They're otherwise identical (except you have to escape whichever you choose to use, if they appear inside the string).
Consider these strings:
"Don't do that."
'I said, "okay".'
"""She said, "That won't work"."""
Which quote is "best"?
Semantically there is no difference in Python; use either. Python also provides the handy triple string delimiter """ or ''' which can simplify multi-line quotes. There is also the raw string literal (r"..." or r'...') to inhibit \ escapes. The Language Reference has all the details.
For string constants containing a single quote use the double quote as delimiter.
The other way around, if you need a double quote inside.
Quick, shiftless typing leads to single quote delimiters.
>>> "it's very simple"
>>> 'reference to the "book"'
Single and double quotes act identically in Python. Escapes (\n) always work, and there is no variable interpolation. (If you don't want escapes, you can use the r flag, as in r"\n".)
Since I'm coming from a Perl background, I have a habit of using single quotes for plain strings and double-quotes for formats used with the % operator. But there is really no difference.
Other answers are about nested quoting. Another point of view I've come across, but I'm not sure I subscribe to, is to use single-quotes(') for characters (which are strings, but ord/chr are quick picky) and to use double-quotes for strings. Which disambiguates between a string that is supposed to be one character and one that just happens to be one character.
Personally I find most touch typists aren't affected noticably by the "load" of using the shift-key. YMMV on that part. Going down the "it's faster to not use the shift" is a slippery slope. It's also faster to use hyper-condensed variable/function/class/module names. Everyone just so loves the fast and short 8.3 DOS files names too. :) Pick what makes semantic sense to you, then optimize.
This is a rule I have heard about:
") If the string is for human consuption, that is interface text or output, use ""
') If the string is a specifier, like a dictionary key or an option, use ''
I think a well-enforced rule like that can make sense for a project, but it's nothing that I would personally care much about. I like the above, since I read it, but I always use "" (since I learned C first wayy back?).
I don't think there is a single best string delimiter. I like to use different delimiters to indicate different kinds of string. Specifically, I like to use "..." to delimit stings that are used for interpolation or that are natural language messages, and '...' to delimit small symbol-like strings. This gives me a subtle extra clue to the expected use for the string literal.
I try to always use raw strings (r"...") for regular expressions because (1) I don't have to escape backslash characters and (2) my editor recognises this convention and does syntax highlighting inside the regex.
The stylistic issues of single- vs. double-quotes are covered in question 56011.

Categories

Resources