How to auto-escape a string in python? - python

Does such a thing exist to turn This is a path into This\ is\ a\ path?
Preferably with all the standard escape sequences such as \t, \n, etc.
It's easy enough to unescape things:
print("foo\nbar".encode("unicode_escape").decode("utf-8"))
I'll admit my first response to this question was use pathlib.Path. Anyone else looking at this, you should probably be doing that too. However, this doesn't work for the situation as I peered in more and asked questions. It literally has to be a tool that changes a string into its escaped in the style of \t \n \r etc. version. It's nothing to do with paths I guess. It's easy enough to just write a simple function, but I'm curious if something exists already.
The web has an encoder, so the request for one for the \t \n style isn't that crazy, is it?
http://www.utilities-online.info/urlencode/#.XVYwRZNKg5c
test this string
and this one
test%20this%20string%0Aand%20this%20one

In python, there is actually a replace function that allows you to replace a character in a string with a different one as shown:
string = 'This is a path'
for ' ' in string:
string.replace(' ', '\\ ', 1)
Notice how in the third argument there is a digit. This is to specify how many characters it should change. The first two arguments are pretty self explanatory though.
The first is what you want to change. The second is what you want to replace it with. I understand you could just set the third argument to how many spaces there are however, it is more efficient and automatic this way as it detects it on it's own.
You can also use regex but I think this is the most 'pythonic' way of doing things since you don't need any other modules.
Also, I'm not sure why you would need to encode a string. You're definitely making it more difficult for yourself than it should be.

Related

Should I use raw string by default?

I know raw string, like r'hello world', prevents escaping.
Is it a good practice to always prepend the r symbol even if the string doesn't have any escaping sequences?
Say my exception needs some string literal explanation, I need to connect to a website whose url is a string literal. They don't have backslash. Are there any performance differences between raw string and regular string?
The r sigil means "backslashes in this string are literal backslashes". Putting this sigil on a string which doesn't contain any backslashes is harmless but sometimes mildly confusing to a human reader. A better approach is probably to only use this sigil when you actually need it.
Situations where the string may not contain backslashes at the moment, but where you might expect to add one in the future, such as in regular expressions and Windows file paths, would probably qualify as useful exceptions.
re.findall(r'hello', string) # what if we switch to r'hello\.'?
with open(r'file.txt') as handle: # what if we switch to r'sub\file.txt'?
It would be easy to forget to add the r when you add a backslash, so supplying it in advance has some merit here.
You can do that in Python. But I don't recommend that because if you add something like '\n', it won't work well. You can use that in Regex and paths on Windows.

How can I identify invisible characters in python strings?

SHORT VERSION
I am retrieving a database value, which contains a short, but full HTML structure. I want to strip away all of the HTML-tags, and just end up with a single value. The HTML surrounding my relevant info, is always the same, and I just need to figure out what kind of line breaks, tabs or whitespaces the string contains, so that I can make a match, and remove it.
Is there a place I can paste the String online, or another way I can check the actual content of the String, so that I'll be able to remove it?
LONG VERSION, and what I've already tried:
The String is retrieved from a HP Quality Center database, and printed in the console of the automated test execution, the string is interpreted to show as two whitespaces. When pasted into word, eclipse or the QC script editor, it is shown as a linebreak.
I've tried to replace the whitespaces with \n, double whitespace and ¶. Nothing works.
I am translatnig this script from a working VBScript. The problematic invisible characters are defined as vbcrlf and VBCRLF there. For some reason they use lower case in the replace String before the relevant parameter value, and upper case in the string that comes after my relevant substring. They are defined as variables, and are not inside the String itself: <html>"&vbcrlf&"<body>"&vbcrlf&"<div...
This website suggests that I should use \n https://answers.yahoo.com/question/index?qid=20070506205148AAmr92N, as they write:
vbCrLf = "\n" # Carriage returnlinefeed combination
I am a little confused by the inconsitency of the upper/lower case use here though...
EDIT:
After googling Carriage returnlinefeed combination, i learned that it can be defined as /r/n here: Order of carriage return and new line feed.
But I spent an awful long time finding it, and it doesn't answer my question, of how I better can identify exactly what kind of invisible characters a string contains. I'll leave the question open.
To view the contents of a string (including it's "hidden" values) you can always do:
print( [data] )
# or
print( repr(data) )
If you're in a system which you described in the comments you can also do:
with open('/var/log/debug.log', 'w') as fh:
fh.write( str( [data] ) )
This will however just give you a general idea of what your data looks like, but if that solves your question or problem then that is great. If you need further assistance, edit your question or submit a new one :)

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Python: what kind of literal delimiter is "better" to use?

What is the best literal delimiter in Python and why? Single ' or double "? And most important, why?
I'm a beginner in Python and I'm trying to stick with just one. I know that in PHP, for example " is preferred, because PHP does not try to search for the 'string' variable. Is the same case in Python?
' because it's one keystroke less than ". Save your wrists!
They're otherwise identical (except you have to escape whichever you choose to use, if they appear inside the string).
Consider these strings:
"Don't do that."
'I said, "okay".'
"""She said, "That won't work"."""
Which quote is "best"?
Semantically there is no difference in Python; use either. Python also provides the handy triple string delimiter """ or ''' which can simplify multi-line quotes. There is also the raw string literal (r"..." or r'...') to inhibit \ escapes. The Language Reference has all the details.
For string constants containing a single quote use the double quote as delimiter.
The other way around, if you need a double quote inside.
Quick, shiftless typing leads to single quote delimiters.
>>> "it's very simple"
>>> 'reference to the "book"'
Single and double quotes act identically in Python. Escapes (\n) always work, and there is no variable interpolation. (If you don't want escapes, you can use the r flag, as in r"\n".)
Since I'm coming from a Perl background, I have a habit of using single quotes for plain strings and double-quotes for formats used with the % operator. But there is really no difference.
Other answers are about nested quoting. Another point of view I've come across, but I'm not sure I subscribe to, is to use single-quotes(') for characters (which are strings, but ord/chr are quick picky) and to use double-quotes for strings. Which disambiguates between a string that is supposed to be one character and one that just happens to be one character.
Personally I find most touch typists aren't affected noticably by the "load" of using the shift-key. YMMV on that part. Going down the "it's faster to not use the shift" is a slippery slope. It's also faster to use hyper-condensed variable/function/class/module names. Everyone just so loves the fast and short 8.3 DOS files names too. :) Pick what makes semantic sense to you, then optimize.
This is a rule I have heard about:
") If the string is for human consuption, that is interface text or output, use ""
') If the string is a specifier, like a dictionary key or an option, use ''
I think a well-enforced rule like that can make sense for a project, but it's nothing that I would personally care much about. I like the above, since I read it, but I always use "" (since I learned C first wayy back?).
I don't think there is a single best string delimiter. I like to use different delimiters to indicate different kinds of string. Specifically, I like to use "..." to delimit stings that are used for interpolation or that are natural language messages, and '...' to delimit small symbol-like strings. This gives me a subtle extra clue to the expected use for the string literal.
I try to always use raw strings (r"...") for regular expressions because (1) I don't have to escape backslash characters and (2) my editor recognises this convention and does syntax highlighting inside the regex.
The stylistic issues of single- vs. double-quotes are covered in question 56011.

Parsing in Python: what's the most efficient way to suppress/normalize strings?

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Categories

Resources