Python regex - (\w+) results different output when used with complex expression - python

I have doubt on python regex operation. Here you go my sample test.
>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>
>>>re.match(r'(\w+):(\d+)','a-b:1')
>>>
Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?
However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?

Taking a look at the actual match will give you an idea of what happens.
>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)
As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.
Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string”. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.
Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:
>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')
For further information on the search vs. match topic, check this section in the manual.
And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:
>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')

The first matches the a - one or more word chars.
The second is one or more word chars immediately followed by a : which there aren't...
[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.

Match's docs say
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.
print re.match(r'(\w+)','a-b').group()
will print
a
In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.
How can I check whether the given string doesn't contain any special characters?
I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match
Remember, you

Related

Match only final character of regex

I have some Python code that involves a lot of re.sub() commands. In some cases, I want to replace a character but only if it comes after certain other characters. The following is an example of how I currently am doing this in python:
secStress = "[aeiou],"[-1]
So my input for this would be a string like "a,s I walk, I hum." And I want to replace that first comma but not the "a" that comes before it.
The problem is that Python doesn't like when I give it a variable as input for re.sub(). Is there a way I can write a regex that specifies that only its final character is supposed to be matched?
You are looking for either a capturing group/backreference or a positive lookbehind solution:
s = "a,s I walk, I hum."
# Capturing group / backreference
print(re.sub(r"([aeiou]),", r"\1", s))
# Positive lookbehind
print(re.sub(r"(?<=[aeiou]),", "", s))
See the Python demo.
First approach details
The ([aeiou]) is a capturing group that matches a vowel and stores it in a special memory buffer that you can refer to from the replacement pattern using backreferences. Here, the Group ID is 1, so you can access that value using r"\1".
Second approach details
The (?<=[aeiou]) is a positive lookbehind that only checks (but does not add the text to the match value) if there is a vowel immediately before the current position. So, only those commas are matched that are preceded with a vowel and it is enough to replace with an empty string to get rid of the comma since it is the only symbol kept in the match.
If I understand you correctly,
>>> import re
>>> def doit(matchobj):
... return matchobj.group()[0]
...
>>> re.sub(r'[aeiou],', doit, "a,s I walk, I hum.")
'as I walk, I hum.'
If the regex matches then doit is called with the object that matched. Whatever string doit returns (and it must be a string) is put in place of the match.

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Can't find the correct regex syntax to match newline or end of string

This feels like a really simple question, but I can't find the answer anywhere.
(Notes: I'm using Python, but this shouldn't matter.)
Say I have the following string:
s = "foo\nbar\nfood\nfoo"
I am simply trying to find a regex that will match both instances of "foo", but not "food", based on the fact that the "foo" in "food" is not immediately followed by either a newline or the end of the string.
This is perhaps an overly complicated way to express my question, but it gives something concrete to work with.
Here are some of the things I have tried, with results (Note: the result I want is [foo\n, foo]):
foo[\n\Z] => ['foo\n']
foo(\n\Z) => ['\n', ''] <= This seems to match the newline and EOS, but not the foo
foo($|\n) => ['\n', '']
(foo)($|\n) => [(foo,'\n'), (foo,'')] <= Almost there, and this is a useable plan B, but I would like to find the perfect solution.
The only thing I found that does work is:
foo$|foo\n => ['foo\n', `'foo']
This is fine for such a simple example, but it is easy to see how it could become unwieldy with a much larger expression (and yes, this foo thing is a stand in for the larger expression I am actually using).
Interesting aside: The closest SO question I could find to my problem was this one: In regex, match either the end of the string or a specific character
Here, I could simply substitute \n for my 'specific character'. Now, the accepted answer uses the regex /(&|\?)list=.*?(&|$)/. I notice that the OP was using JavaScript (question was tagged with the javascript tag), so maybe the JavaScript regex interpreter is different, but when I use the exact strings given in the question with the above regex in Python, I get bad results:
>>> findall("(&|\?)list=.*?(&|$)", "index.php?test=1&list=UL")
[('&', '')]
>>> findall("(&|\?)list=.*?(&|$)", "index.php?list=UL&more=1")
[('?', '&')]
So, I'm stumped.
>>> import re
>>> re.findall(r'foo(?:$|\n)', "foo\nbar\nfood\nfoo")
['foo\n', 'foo']
(?:...) makes a non-capturing group.
This works because (from re module reference):
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
You could use re.MULTILINE and include an optional linebreak after the $ in your pattern:
s = "foo\nbar\nfood\nfoo"
pattern = re.compile('foo$\n?', re.MULTILINE)
print re.findall(pattern, s)
# -> ['foo\n', 'foo']
If you're only concerned with foo:
In [42]: import re
In [43]: strs="foo\nbar\nfood\nfoo"
In [44]: re.findall(r'\bfoo\b',strs)
Out[44]: ['foo', 'foo']
\b is denotes a word boundary:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
(Source)

Extract portions of text if Regex in Python

I have a a previously matched pattern such as:
<a href="somelink here something">
Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.
regex_pattern=re.compile('href=\"(.*?)\"')
Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)
I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.
In simple words I want to match
abcdef=\"______________________\"
in the pattern but want only the
____________________
Part
How do I do this?
Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.
Take a look at the .group() method on regular expression MatchObject results.
Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.
Demonstration:
>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"')
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'
From the Regular Expression syntax documentation on the (...) parenthesis syntax:
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

regular expression question (python)

I want to read a word html file and grab any words which contain letters of a name but not print them if the words are longer than the name
# compiling the regular expression:
keyword = re.compile(r"^[(rR)|(yY)|(aA)|(nN)]{5}$/")
if keyword.search (line):
print line,
i am grabbing the words with this but don't seem to be limiting the size properly.
it seems you are looking for keyword.match() instead of keyword.search(). you should read this part of the python documentation which discusses the difference between match and search.
also, your regular expression seems completely off... [ and ] delimits a set of characters, so you can't put groups and have a logic around the groups. as written, your expression will also match all (, ) and |. you may try the following:
keyword = re.compile(r"^[rRyYaAnN]{5}$")
Your RE "^[(rR)|(yY)|(aA)|(nN)]{5}$/" will never never never give a matching in any string on earth and elsewhere, I think, because of the '/' character after '$'
See the results of the RE without this '/':
import re
pat = re.compile("^[(rR)|(yY)|(aA)|(nN)]{5}$")
for ch in ('arrrN','Aar)N','()|Ny','NNNNN',
'marrrN','12Aar)NUUU','NNNNN!'):
print ch.ljust(15),pat.search(ch)
result
arrrN <_sre.SRE_Match object at 0x011C8EC8>
Aar)N <_sre.SRE_Match object at 0x011C8EC8>
()|Ny <_sre.SRE_Match object at 0x011C8EC8>
NNNNN <_sre.SRE_Match object at 0x011C8EC8>
marrrN None
12Aar)NUUU None
NNNNN! None
My advice: think of [.....] in a RE as representing ONE character at ONE position. So every character that is between the brackets is one of the options of represented character.
Moreover, as said by Adrien Plisson, between brackets [......] a lot of special characters lost their speciality. Hence '(', ')','|' don't define group and OR, they represent just these characters as some of the options along with the letters 'aArRyYnN'
.
"^[rRyYaAnN]{1,5}$" will match only strings as 'r',ar','YNa','YYnA','Nanny'
If you want to match the same words anywhere in a text, you will need "[rRyYaAnN]{1,5}"

Categories

Resources