Regex from python list - python

Today, I found out that regex r"['a', 'b']" matches 'a, b'.
Why is that? What does comma and ' mean inside []?
Thank you.

[] is used to define character sets in regular expressions. The expression will match if the string contains any of the characters in that set.
Your regular expression:
r"['a', 'b']"
Says "match if string contains ' or a or , or b. As #Patrick Haugh mentions in his comment. Your expression is equivalent to [',ab]. Repeating the same character in the set does nothing.
http://www.regexpal.com/ is a great site for testing your regular expressions. It can help break it down for you and explain what your expression does and why it matches on certain strings.

Related

Python 2.7 regular expression match issue

Suppose I am using the following regular expression to match, logically the regular expression means match anything with prefix foo: and ends with anything which is not a space. Match group will be the parts exclude prefix foo
My question is what exactly means anything in Python 2.7? Any ASCII or? If anyone could share some document, it will be great. Thanks.
a = re.compile('foo:([^ ]+)')
thanks in advance,
Lin
Try:
a = re.compile('foo:\S*')
\S means anything but whitespace.
I recommend you check out http://pythex.org.
It's really good for testing out regular expresions and has a decent cheat-sheet.
UPDATE:
Anything (.) matches anything, all unicode/UTF-8 characters.
The regular expression metacharacter which matches any character is . (dot).
a = re.compile('foo:(.+)')
The character class [^ ] matches any one character which isn't one of the characters between the square brackets (a literal space, in this example). The quantifier + specifies one or more repetitions of the preceding expression.

search for string embedded in {} after keyword

How can I get the string embedded in {} after a keyword, where the number of characters between the keyword and the braces {} is unknown. e.g.:
includegraphics[x=2]{image.pdf}
the keyword would be includegraphics and the string to be found is image.pdf, but the text in between [x=2] could have anything between the two [].
So I want to ignore all characters between the keyword and { or I want to ignore everything between []
Use re.findall
>>> sample = 'includegraphics[x=2]{image.pdf}'
>>> re.findall('includegraphics.*?{(.*?)}',sample)
['image.pdf']
Explanation:
The re module deals with regular expressions in Python. Its findall method is useful to find all occurences of a pattern in a string.
A regular expression for the pattern you are interested in is 'includegraphics.*?{(.*?)}'. Here . symbolizes "any character", while the * means 0 or more times. The question mark makes this a non-greedy operation. From the documentation:
The *, +, and ? qualifiers are all greedy; they match as much
text as possible. Sometimes this behaviour isn’t desired; if the RE
<.*> is matched against <H1\>title</H1>, it will match the entire
string, and not just <H1>. Adding ? after the qualifier makes it
perform the match in non-greedy or minimal fashion; as few characters
as possible will be matched. Using .*? in the previous expression will
match only <H1>.
Please note that while in your case using .*? should be fine, in general it's better to use more specialized character groups such as \w for alphanumerics and \d for digits, when you know what the content is going to consist of in advance.
Use re.search
re.search(r'includegraphics\[[^\[\]]*\]\{([^}]*)\}', s).group(1)

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Regular Expression returns nothing

I have written the following regular expression to return everything except alphabets & letters. However this regular expression returns nothing. What can be the regular expression for such case?
Regex:
r'[^[a-z]+]'
Regards
You are messing with the character class []. Here is the correct one(without uppercase):
r'[^a-z]+'
If you want to match with start and end of string, including Upper case letters.
r'^[^a-zA-Z]+$'
And here is how you can use it:
print re.findall(r'([^a-zA-Z]+)', input_string)
() means capture the group so that it returns after the matching is performed.
This is how the regex engine see's your regex
[^[a-z]+ # Not any of these characters '[', nor a-z
] # literal ']'
So, as #Sajuj says, just need to remove the outer square brackets [^a-z]+

python regular expressions to return complete result

How can I get what was matched from a python regular expression?
re.match("^\\\w*", "/welcome")
All python returns is a valid match; but I want the entire result returned; how do I do that?
Just use re.findall function.
>>> re.findall("a+", 'sdaaddaa')
['aa', 'aa']
You could use a group.
res = re.search("^(\\\w*)", "/welcome")
if res:
res.group(1);
Calling the group() method of the returned match object without any arguments will return the matched portion of the string.
The regular expression "^\\\w*" will match a string beginning with a backslash followed by 0 or more w characters. The string you are searching begins with a forward slash so your regex won't match. That's why you aren't getting anything back.
Note that your regex, if you printed out the string contains \\w. The \\ means match a single backslash then the w means match a literal w. If you want a backslash followed by a word character then you will need to escape the first backslash and the easiest way would be to use a raw string r"^\\\w*" would match "\\welcome" but still not match "/welcome".
Notice that you're "^" says you're string has to start at the beginning of a line. RegexBuddy doesn't tell that to you by default.
Maybe you want to tell us what exactly are you trying to find?

Categories

Resources