Could anyone tell me what the [Cc] in this code is called? I know what it does but I have no idea what it is called.
#!/usr/bin/perl
$sentence = "Big cat sat.";
$sentence =~ /[Cc]at/;
print "$`, $&, $'\n"; #prints Big, cat, sat.
Also does anyone know what is the perl equivalent of python 2.7's re.search? All I keep finding is something about python's replace being mutable and does not really say anything about search.
Bracketed groups of characters are called character classes or character sets.
Regular expressions have a simple formal definition with just a few operations. One of these operations is alternation. Alternations allow you to match against the union of two sets of strings. Character sets are syntax for an alternation over a group of single character strings. More commonly when we talk about alternations in regular expressions we are referring to the use of the vertical bar | which matches the union of the expressions on either side of the bar.
I don't really understand the close votes, but you've made the mistake of asking more than one question!
It's hard to know what's tripping you up, but this may help
The pattern /[Cc]at/ as a whole is a regular expression, regexp or regex, while the particular component [Cc] is called a character class, which matches any of a set of characters; in this case an upper or lower-case C character. It's documented in the Python documentation for Regular Expression Syntax, which calls it just a "set of characters", and speaks about things like \d (numeric digits) and \w ("word" characters) as character classes. In Perl, the square-bracket construct is also a character class
The documentation for re.search on the same page is fairly simple, and you seem to have used its Perl equivalent in your code so I don't understand the problem you're having
In Python,
object = re.search(pattern, string)
checks for the occurrence of pattern anywhere in string and sets object to a match object if one is found, or None otherwise
This is the same in Perl as using the binding operator =~ like this
my $result = $string =~ /pattern/
which sets $result to a true value if a match was found, or false otherwise
Take a look at the Python documentation for search() vs. match()
re.match is identical to re.search, except that the match must occur at the very start of the string
Related
My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?
The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check if text is “empty” (spaces, tabs, newlines) in Python?
I am trying to write a short function to process lines of text in a file. When it encounters a line with significant content (meaning more than just whitespace), it is to do something with that line. The control structure I wanted was
if '\S' in line: do something
or
if r'\S' in line: do something
(I tried the same combinations with double quotes also, and yes I had imported re.) The if statement above, in all the forms I tried, always returns False. In the end, I had to resort to the test
if re.search('\S', line) is not None: do something
This works, but it feels a little clumsy in relation to a simple if statement. My question, then, is why isn't the if statement working, and is there a way to do something as (seemingly) elegant and simple?
I have another question unrelated to control structures, but since my suspicion is that it is also related to a possibly illegal use of regular expressions, I'll ask it here. If I have a string
s = " \t\tsome text \t \n\n"
The code
s.strip('\s')
returns the same string complete with spaces, tabs, and newlines (r'\s' is no different). The code
s.strip()
returns "some text". This, even though strip called with no character string supposedly defaults to stripping whitespace characters, which to my mind is exactly what the expression '\s' is doing. Why is the one stripping whitespace and the other not?
Thanks for any clarification.
Python string functions are not aware of regular expressions, so if you want to use them you have to use the re module.
However if you are only interested in finding out of a string is entirely whitespace or not, you can use the str.isspace() function:
>>> 'hello'.isspace()
False
>>> ' \n\t '.isspace()
True
This is what you're looking for
if not line.isspace(): do something
Also, str.strip does not use regular expressions.
If you are really just want to find out if the line only consists of whitespace characters regex is a little overkill. You should got for the following instead:
if text.strip():
#do stuff
which is basically the same as:
if not text.strip() == "":
#do stuff
Python evaluates every non-empty string to True. So if text consists only of whitespace-characters, text.strip() equals "" and therefore evaluates to False.
The expression '\S' in line does the same thing as any other string in line test; it tests whether the string on the left occurs inside the string on the right. It does not implicitly compile a regular expression and search for a match. This is a good thing. What if you were writing a program that manipulated regular expressions input by the user and you actually wanted to test whether some sub-expression like \S was in the input expression?
Likewise, read the documentation of str.strip. Does it say that will treat it's input as a regular expression and remove matching strings? No. If you want to do something with regular expressions, you have to actually tell Python that, not expect it to somehow guess that you meant a regular expression this time while other times it just meant a plain string. While you might think of searching for a regular expression as very similar to searching for a string, they are completely different operations as far as the language implementation is concerned. And most str methods wouldn't even make sense when applied to a regular expression.
Because re.match objects are "truthy" in boolean context (like most class instances), you can at least shorten your if statement by dropping the is not None test. The rest of the line is necessary to actually tell Python what you want. As for your str.strip case (or other cases where you want to do something similar to a string operation but with a regular expression), have a look at the functions in the re module; there are a number of convenience functions on there that can be helpful. Or else it should be pretty easy to implement a re_split function yourself.
Perhaps a silly question, but though google returned lots of similar cases, I could not find this exact situation: what regular expression will match all string NOT containing a particular string. For example, I want to match any string that does not contain 'foo_'.
Now,
re.match('(?<!foo_).*', 'foo_bar')
returns a match. While
re.match('(?<!foo_)bar', 'foo_bar')
does not.
I tried the non-greedy version:
re.match('(?<!foo_).*?', 'foo_bar')
still returns a match.
If I add more characters after the ),
re.search('(?<!foo_)b.*', 'foo_bar')
it returns None, but if the target string has more trailing chars:
re.search('(?<!foo_)b.*', 'foo_barbaric')
it returns a match.
I intentionally kept out the initial .* or .*? in the re. But same thing happens with that.
Any ideas why this strange behaviour? (I need this as a single regular expression - to be entered as a user input).
You're using lookbehind assertions where you need lookahead assertions:
re.match(r"(?!.*foo_).*", "foo_bar")
would work (i. e. not match).
(?!.*foo_) means "Assert that it is impossible to match .*foo_ from the current position in the string. Since you're using re.match(), that position is automatically defined as the start of the string.
Try this pattern instead:
^(?!.*foo_).*
This uses the ^ metacharacter to match from the beginning of the string, and then uses a negative look-ahead that checks for "foo_". If it exists, the match will fail.
Since you gave examples using both re.match() and re.search(), the above pattern would work with both approaches. However, when you're using re.match() you can safely omit the usage of the ^ metacharacter since it will match at the beginning of the string, unlike re.search() which matches anywhere in the string.
I feel like there is a good chance that you could just design around this with a conditional statement.
(It would be nice if we knew specifically what you're trying to accomplish).
Why not:
if not re.match("foo", something):
do_something
else:
print "SKipping this"
I have this Perl snippet from a script that I am translating into Python. I have no idea what the "s!" operator is doing; some sort of regex substitution. Unfortunately searching Google or Stackoverflow for operators like that doesn't yield many helpful results.
$var =~ s!<foo>.+?</foo>!!;
$var =~ s!;!/!g;
What is each line doing? I'd like to know in case I run into this operator again.
And, what would equivalent statements in Python be?
s!foo!bar! is the same as the more common s/foo/bar/, except that foo and bar can contain unescaped slashes without causing problems. What it does is, it replaces the first occurence of the regex foo with bar. The version with g replaces all occurences.
It's doing exactly the same as $var =~ s///. i.e. performing a search and replace within the $var variable.
In Perl you can define the delimiting character following the s. Why ? So, for example, if you're matching '/', you can specify another delimiting character ('!' in this case) and not have to escape or backtick the character you're matching. Otherwise you'd end up with (say)
s/;/\//g;
which is a little more confusing.
Perlre has more info on this.
Perl lets you choose the delimiter for many of its constructs. This makes it easier to see what is going on in expressions like
$str =~ s{/foo/bar/baz/}{/quux/};
As you can see though, not all delimiters have the same effects. Bracketing characters (<>, [], {}, and ()) use different characters for the beginning and ending. And ?, when used as a delimiter to a regex, causes the regexes to match only once between calls to the reset() operator.
You may find it helpful to read perldoc perlop (in particular the sections on m/PATTERN/msixpogc, ?PATTERN?, and s/PATTERN/REPLACEMENT/msixpogce).
s! is syntactic sugar for the 'proper' s/// operator. Basically, you can substitute whatever delimiter you want instead of the '/'s.
As to what each line is doing, the first line is matching occurances of the regex <foo>.+?</foo> and replacing the whole lot with nothing. The second is matching the regex ; and replacing it with /.
s/// is the substitute operator. It takes a regular expression and a substitution string.
s/regex/replace string/;
It supports most (all?) of the normal regular expression switches, which are used in the normal way (by appending them to the end of the operator).
s is the substitution operator. Usually it is in the form of s/foo/bar/, but you can replace // separator characters some other characters like !. Using other separator charaters may make working with things like paths a lot easier since you don't need to escape path separators.
See manual page for further info.
You can find similar functionality for python in re-module.
s is the substitution operator. Normally this uses '/' for the delimiter:
s/foo/bar/
, but this is not required: a number of other characters can be used as delimiters instead. In this case, '!' has been used as the delimiter, presumably to avoid the need to escape the '/' characters in the actual text to be substituted.
In your specific case, the first line removes text matching '.+?'; i.e. it removes 'foo' tags with or without content.
The second line replaces all ';' characters with '/' characters, globally (all occurences).
The python equivalent code uses the re module:
f=re.sub(searchregx,replacement_str,line)
And the python equivalent is to use the re module.