Python 2.7 regular expression match issue - python

Suppose I am using the following regular expression to match, logically the regular expression means match anything with prefix foo: and ends with anything which is not a space. Match group will be the parts exclude prefix foo
My question is what exactly means anything in Python 2.7? Any ASCII or? If anyone could share some document, it will be great. Thanks.
a = re.compile('foo:([^ ]+)')
thanks in advance,
Lin

Try:
a = re.compile('foo:\S*')
\S means anything but whitespace.
I recommend you check out http://pythex.org.
It's really good for testing out regular expresions and has a decent cheat-sheet.
UPDATE:
Anything (.) matches anything, all unicode/UTF-8 characters.

The regular expression metacharacter which matches any character is . (dot).
a = re.compile('foo:(.+)')
The character class [^ ] matches any one character which isn't one of the characters between the square brackets (a literal space, in this example). The quantifier + specifies one or more repetitions of the preceding expression.

Related

Python regex: alternative positive lookbehind assertion

I have the following regex expression which is meant to find the "IF" keyword (case insensitive) in a string. Some constraints are imposed:
It should be preceded by a whitespace or a ) character (from a previous expression)
It should be followed by whitespace or ( character
The below expression accomplishes these constraints. However, this expression does not find the keyword when it's located at the start of a string (if(foo, 1, 2) for instance).
Using something like ^|(?<=[\s\)])(?i)if(?=[\s\(]) does not seem to work. I tried ?:^|[\s\)]) but that seems to also capture the space in front of the keyword.
This is what I have so far:
(?<=[\s\)])(?i)if(?=[\s\(])
You may use an alternation group with two zero-width assertions:
(?i)(?:^|(?<=[\s)]))if(?=[\s(])
^^^^^^^^^^^^^^^^
See the regex demo.
Here, (?:^|(?<=[\s)])) matches:
^ - start of string
| - or
(?<=[\s)]) - a location that is immediately preceded with a whitespace or ) character.
Note that the (?i) inline case insensitive modifier in a Python re regex affects the whole pattern regardless of where it is located in it, so I suggest moving it to the pattern start for better visibility.
Also, there is no need to escape ( and ) inside character classes, [...] constructs, as they are treated as literal parentheses inside them.
The problem is that | is applied at top level, so it is an alteration between:
^ and (?<=[\s\)])(?i)if(?=[\s\(]).
Just add non-capturing group around ^ and (?<=[\s\)]):
(?:^|(?<=[\s\)]))(?i)if(?=[\s\(])
You can solve the problem (for this particular case that only involves single characters) using a double negation:
(?<![^\s)])
(not preceded by a character that is not a whitespace nor a closing parenthesis). This condition includes the start of the string too.

Regex, better way

How do you separate a regex, that could be matched multiple times within a string, if the delimiter is within the string, ie:
Well then 'Bang bang swing'(BBS) aota 'Bing Bong Bin'(BBB)
With the regex: "'.+'(\S+)"
It would match from Everything from 'Bang ... (BBB) instead of matching 'Bang bang swing'(BBS) and 'Bing Bong Bin'(BBB)
I have a manner of making this work with regex: '[A-z0-9-/?|q~`!##$%^&*()_-=+ ]+'(\S+)
But this is excessive, and honestly I hate that it even works correctly.
I'm fairly new to regexes, and beginning with Pythons implementation of them is apparently not the smartest manner in which to start it.
To get a substring from one character up to another character, where neither can appear in-between, you should always consider using negated character classes.
The [negated] character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don't want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.
So, you can use
'[^']*'\([^()]*\)
See regex demo
Here,
'[^']*' - matches ' followed by 0 or more characters other than ' and then followed by a ' again
\( - matches a literal ) (it must be escaped)
[^()]* - matches 0 or more characters other than ( and ) (they do not have to be escaped inside a character class)
\) - matches a literal ) (must be escaped outside a character class).
If you might have 1 or more single quotes before (...) part, you will need an unrolled lazy matching regex:
'[^']*(?:'(?!\([^()]*\))[^']*)*'\([^()]*\)
See regex demo.
Here, the '[^']*(?:'(?!\([^()]*\))[^']*)*' is matching the same as '.*?' with DOTALL flag, but is much more efficient due to the linear regex execution. See more about unrolling regex technique here.
EDIT:
When input strings are not complex and short, lazy dot matching turns out more efficient. However, when complexity grows, lazy dot matching may cause issues.
How about this regular expression
'.+?'\(\S+\)

search for string embedded in {} after keyword

How can I get the string embedded in {} after a keyword, where the number of characters between the keyword and the braces {} is unknown. e.g.:
includegraphics[x=2]{image.pdf}
the keyword would be includegraphics and the string to be found is image.pdf, but the text in between [x=2] could have anything between the two [].
So I want to ignore all characters between the keyword and { or I want to ignore everything between []
Use re.findall
>>> sample = 'includegraphics[x=2]{image.pdf}'
>>> re.findall('includegraphics.*?{(.*?)}',sample)
['image.pdf']
Explanation:
The re module deals with regular expressions in Python. Its findall method is useful to find all occurences of a pattern in a string.
A regular expression for the pattern you are interested in is 'includegraphics.*?{(.*?)}'. Here . symbolizes "any character", while the * means 0 or more times. The question mark makes this a non-greedy operation. From the documentation:
The *, +, and ? qualifiers are all greedy; they match as much
text as possible. Sometimes this behaviour isn’t desired; if the RE
<.*> is matched against <H1\>title</H1>, it will match the entire
string, and not just <H1>. Adding ? after the qualifier makes it
perform the match in non-greedy or minimal fashion; as few characters
as possible will be matched. Using .*? in the previous expression will
match only <H1>.
Please note that while in your case using .*? should be fine, in general it's better to use more specialized character groups such as \w for alphanumerics and \d for digits, when you know what the content is going to consist of in advance.
Use re.search
re.search(r'includegraphics\[[^\[\]]*\]\{([^}]*)\}', s).group(1)

Match first instance of Python regex search

I'm looking to the first instance of a match two square brackets using regular expressions. Currently, I am doing
regex = re.compile("(?<=(\[\[)).*(?=\]\])")
r = regex.search(line)
which works for lines like
[[string]]
returns string
but when I try it on a separate line:
[[string]] ([[string2]], [[string3]])
The result is
string]] ([[string2]], [[string3
What am I missing?
Python *, +, ? and {n,m} quantifiers are greedy by default
Patterns quantified with the above quantifiers will match as much as they can by default. In your case, this means the first set of brackets and the last. In Python, you can make any quantifier non-greedy (or "lazy") by adding a ? after it. In your case, this would translate to .*? in the middle part of your expression.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources