Python: regular expressions in control structures [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check if text is “empty” (spaces, tabs, newlines) in Python?
I am trying to write a short function to process lines of text in a file. When it encounters a line with significant content (meaning more than just whitespace), it is to do something with that line. The control structure I wanted was
if '\S' in line: do something
or
if r'\S' in line: do something
(I tried the same combinations with double quotes also, and yes I had imported re.) The if statement above, in all the forms I tried, always returns False. In the end, I had to resort to the test
if re.search('\S', line) is not None: do something
This works, but it feels a little clumsy in relation to a simple if statement. My question, then, is why isn't the if statement working, and is there a way to do something as (seemingly) elegant and simple?
I have another question unrelated to control structures, but since my suspicion is that it is also related to a possibly illegal use of regular expressions, I'll ask it here. If I have a string
s = " \t\tsome text \t \n\n"
The code
s.strip('\s')
returns the same string complete with spaces, tabs, and newlines (r'\s' is no different). The code
s.strip()
returns "some text". This, even though strip called with no character string supposedly defaults to stripping whitespace characters, which to my mind is exactly what the expression '\s' is doing. Why is the one stripping whitespace and the other not?
Thanks for any clarification.

Python string functions are not aware of regular expressions, so if you want to use them you have to use the re module.
However if you are only interested in finding out of a string is entirely whitespace or not, you can use the str.isspace() function:
>>> 'hello'.isspace()
False
>>> ' \n\t '.isspace()
True

This is what you're looking for
if not line.isspace(): do something
Also, str.strip does not use regular expressions.

If you are really just want to find out if the line only consists of whitespace characters regex is a little overkill. You should got for the following instead:
if text.strip():
#do stuff
which is basically the same as:
if not text.strip() == "":
#do stuff
Python evaluates every non-empty string to True. So if text consists only of whitespace-characters, text.strip() equals "" and therefore evaluates to False.

The expression '\S' in line does the same thing as any other string in line test; it tests whether the string on the left occurs inside the string on the right. It does not implicitly compile a regular expression and search for a match. This is a good thing. What if you were writing a program that manipulated regular expressions input by the user and you actually wanted to test whether some sub-expression like \S was in the input expression?
Likewise, read the documentation of str.strip. Does it say that will treat it's input as a regular expression and remove matching strings? No. If you want to do something with regular expressions, you have to actually tell Python that, not expect it to somehow guess that you meant a regular expression this time while other times it just meant a plain string. While you might think of searching for a regular expression as very similar to searching for a string, they are completely different operations as far as the language implementation is concerned. And most str methods wouldn't even make sense when applied to a regular expression.
Because re.match objects are "truthy" in boolean context (like most class instances), you can at least shorten your if statement by dropping the is not None test. The rest of the line is necessary to actually tell Python what you want. As for your str.strip case (or other cases where you want to do something similar to a string operation but with a regular expression), have a look at the functions in the re module; there are a number of convenience functions on there that can be helpful. Or else it should be pretty easy to implement a re_split function yourself.

Related

Matching characters in two Python strings

I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

What does the "s!" operator in Perl do?

I have this Perl snippet from a script that I am translating into Python. I have no idea what the "s!" operator is doing; some sort of regex substitution. Unfortunately searching Google or Stackoverflow for operators like that doesn't yield many helpful results.
$var =~ s!<foo>.+?</foo>!!;
$var =~ s!;!/!g;
What is each line doing? I'd like to know in case I run into this operator again.
And, what would equivalent statements in Python be?
s!foo!bar! is the same as the more common s/foo/bar/, except that foo and bar can contain unescaped slashes without causing problems. What it does is, it replaces the first occurence of the regex foo with bar. The version with g replaces all occurences.
It's doing exactly the same as $var =~ s///. i.e. performing a search and replace within the $var variable.
In Perl you can define the delimiting character following the s. Why ? So, for example, if you're matching '/', you can specify another delimiting character ('!' in this case) and not have to escape or backtick the character you're matching. Otherwise you'd end up with (say)
s/;/\//g;
which is a little more confusing.
Perlre has more info on this.
Perl lets you choose the delimiter for many of its constructs. This makes it easier to see what is going on in expressions like
$str =~ s{/foo/bar/baz/}{/quux/};
As you can see though, not all delimiters have the same effects. Bracketing characters (<>, [], {}, and ()) use different characters for the beginning and ending. And ?, when used as a delimiter to a regex, causes the regexes to match only once between calls to the reset() operator.
You may find it helpful to read perldoc perlop (in particular the sections on m/PATTERN/msixpogc, ?PATTERN?, and s/PATTERN/REPLACEMENT/msixpogce).
s! is syntactic sugar for the 'proper' s/// operator. Basically, you can substitute whatever delimiter you want instead of the '/'s.
As to what each line is doing, the first line is matching occurances of the regex <foo>.+?</foo> and replacing the whole lot with nothing. The second is matching the regex ; and replacing it with /.
s/// is the substitute operator. It takes a regular expression and a substitution string.
s/regex/replace string/;
It supports most (all?) of the normal regular expression switches, which are used in the normal way (by appending them to the end of the operator).
s is the substitution operator. Usually it is in the form of s/foo/bar/, but you can replace // separator characters some other characters like !. Using other separator charaters may make working with things like paths a lot easier since you don't need to escape path separators.
See manual page for further info.
You can find similar functionality for python in re-module.
s is the substitution operator. Normally this uses '/' for the delimiter:
s/foo/bar/
, but this is not required: a number of other characters can be used as delimiters instead. In this case, '!' has been used as the delimiter, presumably to avoid the need to escape the '/' characters in the actual text to be substituted.
In your specific case, the first line removes text matching '.+?'; i.e. it removes 'foo' tags with or without content.
The second line replaces all ';' characters with '/' characters, globally (all occurences).
The python equivalent code uses the re module:
f=re.sub(searchregx,replacement_str,line)
And the python equivalent is to use the re module.

Python: what kind of literal delimiter is "better" to use?

What is the best literal delimiter in Python and why? Single ' or double "? And most important, why?
I'm a beginner in Python and I'm trying to stick with just one. I know that in PHP, for example " is preferred, because PHP does not try to search for the 'string' variable. Is the same case in Python?
' because it's one keystroke less than ". Save your wrists!
They're otherwise identical (except you have to escape whichever you choose to use, if they appear inside the string).
Consider these strings:
"Don't do that."
'I said, "okay".'
"""She said, "That won't work"."""
Which quote is "best"?
Semantically there is no difference in Python; use either. Python also provides the handy triple string delimiter """ or ''' which can simplify multi-line quotes. There is also the raw string literal (r"..." or r'...') to inhibit \ escapes. The Language Reference has all the details.
For string constants containing a single quote use the double quote as delimiter.
The other way around, if you need a double quote inside.
Quick, shiftless typing leads to single quote delimiters.
>>> "it's very simple"
>>> 'reference to the "book"'
Single and double quotes act identically in Python. Escapes (\n) always work, and there is no variable interpolation. (If you don't want escapes, you can use the r flag, as in r"\n".)
Since I'm coming from a Perl background, I have a habit of using single quotes for plain strings and double-quotes for formats used with the % operator. But there is really no difference.
Other answers are about nested quoting. Another point of view I've come across, but I'm not sure I subscribe to, is to use single-quotes(') for characters (which are strings, but ord/chr are quick picky) and to use double-quotes for strings. Which disambiguates between a string that is supposed to be one character and one that just happens to be one character.
Personally I find most touch typists aren't affected noticably by the "load" of using the shift-key. YMMV on that part. Going down the "it's faster to not use the shift" is a slippery slope. It's also faster to use hyper-condensed variable/function/class/module names. Everyone just so loves the fast and short 8.3 DOS files names too. :) Pick what makes semantic sense to you, then optimize.
This is a rule I have heard about:
") If the string is for human consuption, that is interface text or output, use ""
') If the string is a specifier, like a dictionary key or an option, use ''
I think a well-enforced rule like that can make sense for a project, but it's nothing that I would personally care much about. I like the above, since I read it, but I always use "" (since I learned C first wayy back?).
I don't think there is a single best string delimiter. I like to use different delimiters to indicate different kinds of string. Specifically, I like to use "..." to delimit stings that are used for interpolation or that are natural language messages, and '...' to delimit small symbol-like strings. This gives me a subtle extra clue to the expected use for the string literal.
I try to always use raw strings (r"...") for regular expressions because (1) I don't have to escape backslash characters and (2) my editor recognises this convention and does syntax highlighting inside the regex.
The stylistic issues of single- vs. double-quotes are covered in question 56011.

Parsing in Python: what's the most efficient way to suppress/normalize strings?

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Categories

Resources