I know raw string, like r'hello world', prevents escaping.
Is it a good practice to always prepend the r symbol even if the string doesn't have any escaping sequences?
Say my exception needs some string literal explanation, I need to connect to a website whose url is a string literal. They don't have backslash. Are there any performance differences between raw string and regular string?
The r sigil means "backslashes in this string are literal backslashes". Putting this sigil on a string which doesn't contain any backslashes is harmless but sometimes mildly confusing to a human reader. A better approach is probably to only use this sigil when you actually need it.
Situations where the string may not contain backslashes at the moment, but where you might expect to add one in the future, such as in regular expressions and Windows file paths, would probably qualify as useful exceptions.
re.findall(r'hello', string) # what if we switch to r'hello\.'?
with open(r'file.txt') as handle: # what if we switch to r'sub\file.txt'?
It would be easy to forget to add the r when you add a backslash, so supplying it in advance has some merit here.
You can do that in Python. But I don't recommend that because if you add something like '\n', it won't work well. You can use that in Regex and paths on Windows.
Related
I have a series of regular expression patterns defined for automated processing of text. Due to the design of the program, it's better to have these patterns separate in a text file, namely a JSON file. The pattern in Python is of r'' type, but all I can provide is a string. I'd like to retain functionalities such as grouping. I'd like to have features such as entities ([A-z]), so I'm not talking about escaping everything.
I'm using Python 3.4. How do I properly load these patterns into the re module? And what kind of escaping problem should I watch out for?
I am not sure what you want but have a look at this.:
If you have a file called input.txt containing \d+
Then you can use it this way:
import re
f=open("input.txt","r")
x="asasd3243sdfdsf23234sdsdf"
print re.findall(r""+f.readline(),x)
Output:['3243', '23234']
When you use r mode you need not escape anything.
The r'' thing in Python is not a different type than simple ''. The r'' syntax simply creates a string that looks exactly like the one you typed, so the \n sequence stays as \n, and isn't turned into a new line (same thing happens to other special characters). This little r simply escapes everything you type.
Check it yourself with this two simple lines in the console:
print('test \n test')
print(r'test \n test')
print(type(r''))
print(type(''))
Now, while you read lines from JSON file, the escaping is done for you. I don't know how will you create the JSON file, but you should take a look at the json module, and the load method, that will allow you to read a JSON file.
You can use re.escape to escape the strings. However this is escaping everything and you might want some special chars. I'd just use the strings and be careful about placing \ in the right places.
BTW: If you have many regular expressions, matching might get slow. You might want to consider some alternatives such esmre.
This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 7 months ago.
str = r'c:\path\to\folder\' # my comment
IDE: Eclipse
Python2.6
When the last character in the string is a backslash, it seems like it will escape the last single quote and treat my comment as part of the string. But the raw string is supposed to ignore all escape characters, right? What could be wrong? Thanks.
Raw string literals don't treat backslashes as initiating escape sequences except when the immediately-following character is the quote-character that is delimiting the literal, in which case the backslash does escape it.
The design motivation is that raw string literals really exist only for the convenience of entering regular expression patterns – that is all, no other design objective exists for such literals. And RE patterns never need to end with a backslash, but they might need to include all kinds of quote characters, whence the rule.
Many people do try to use raw string literals to enable them to enter Windows paths the way they're used to (with backslashes) – but as you've noticed this use breaks down when you do need a path to end with a backslash. Usually, the simplest solution is to use forward slashes, which Microsoft's C runtime and all version of Python support as totally equivalent in paths:
s = 'c:/path/to/folder/'
(side note: don't shadow builtin names, like str, with your own identifiers – it's a horrible practice, without any upside, and unless you get into the habit of avoiding that horrible practice one day you'll find yourseld with a nasty-to-debug problem, when some part of your code tramples over a builtin name and another part needs to use the builtin name in its real meaning).
It's IMHO an inconsistency in Python, but it's described in the documentation. Go to the second last paragraph:
http://docs.python.org/reference/lexical_analysis.html#string-literals
r"\" is not a valid string literal
(even a raw string cannot end in an
odd number of backslashes)
I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott
There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).
Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').
What is the best literal delimiter in Python and why? Single ' or double "? And most important, why?
I'm a beginner in Python and I'm trying to stick with just one. I know that in PHP, for example " is preferred, because PHP does not try to search for the 'string' variable. Is the same case in Python?
' because it's one keystroke less than ". Save your wrists!
They're otherwise identical (except you have to escape whichever you choose to use, if they appear inside the string).
Consider these strings:
"Don't do that."
'I said, "okay".'
"""She said, "That won't work"."""
Which quote is "best"?
Semantically there is no difference in Python; use either. Python also provides the handy triple string delimiter """ or ''' which can simplify multi-line quotes. There is also the raw string literal (r"..." or r'...') to inhibit \ escapes. The Language Reference has all the details.
For string constants containing a single quote use the double quote as delimiter.
The other way around, if you need a double quote inside.
Quick, shiftless typing leads to single quote delimiters.
>>> "it's very simple"
>>> 'reference to the "book"'
Single and double quotes act identically in Python. Escapes (\n) always work, and there is no variable interpolation. (If you don't want escapes, you can use the r flag, as in r"\n".)
Since I'm coming from a Perl background, I have a habit of using single quotes for plain strings and double-quotes for formats used with the % operator. But there is really no difference.
Other answers are about nested quoting. Another point of view I've come across, but I'm not sure I subscribe to, is to use single-quotes(') for characters (which are strings, but ord/chr are quick picky) and to use double-quotes for strings. Which disambiguates between a string that is supposed to be one character and one that just happens to be one character.
Personally I find most touch typists aren't affected noticably by the "load" of using the shift-key. YMMV on that part. Going down the "it's faster to not use the shift" is a slippery slope. It's also faster to use hyper-condensed variable/function/class/module names. Everyone just so loves the fast and short 8.3 DOS files names too. :) Pick what makes semantic sense to you, then optimize.
This is a rule I have heard about:
") If the string is for human consuption, that is interface text or output, use ""
') If the string is a specifier, like a dictionary key or an option, use ''
I think a well-enforced rule like that can make sense for a project, but it's nothing that I would personally care much about. I like the above, since I read it, but I always use "" (since I learned C first wayy back?).
I don't think there is a single best string delimiter. I like to use different delimiters to indicate different kinds of string. Specifically, I like to use "..." to delimit stings that are used for interpolation or that are natural language messages, and '...' to delimit small symbol-like strings. This gives me a subtle extra clue to the expected use for the string literal.
I try to always use raw strings (r"...") for regular expressions because (1) I don't have to escape backslash characters and (2) my editor recognises this convention and does syntax highlighting inside the regex.
The stylistic issues of single- vs. double-quotes are covered in question 56011.
I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)