Why does this pattern not match? - python

Using this pattern:
(?<=\(\\\\).*(?=\))
and this subject string: '(\\Drafts) "/" "&g0l6P3ux-"'
I was expecting to match Drafts
However, it is not working. Can someone explain why?
I am using re module in Python,the following is what I did:
>>> pattern = re.compile("(?<=\(\\\\).*?(?=\\))")
>>> pattern.pattern
'(?<=\\(\\\\).*?(?=\\))'
>>> two
'(\\Drafts) "/" "&g0l6P3ux-"'
>>> match = pattern.search(two)
>>> match
<_sre.SRE_Match object at 0x1096e45e0>
>>> match.groups()
()
>>> match.group(0)
'Drafts'
>>>
my question is why groups get nothing but group get the right answer?

match.groups() is empty because your pattern does not define any capturing groups. match.group(0) is the complete match, while match.group(1) would be the first capturing group if there was one.
To improve readability you should express regex patterns as raw strings. Yours can be written as
r"(?<=\(\\).*?(?=\))"
To break it down, there is a lookbehind for literal (\, then .*? and finally a lookahead for literal ).

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

python RE white space in the pattern

I am writing a Python script to find a tag name in a string like this:
string='Tag Name =LIC100 State =TRUE'
If a use a expression like this
re.search('Name(.*)State',string)
I get " =LIC100". I would like to get just LIC100.
Any suggestions on how to set up the pattern to eliminate the whitespace and the equal signal?
That is because you get 0+ chars other than line break chars from Name up to the last State. You may restrict the pattern in Group 1 to just non-whitespaces:
import re
string='Tag Name =LIC100 State =TRUE'
m = re.search(r'Name\s*=(\S*)',string)
if m:
print(m.group(1))
See the Python demo
Pattern details:
Name - a literal char sequence
\s* - 0+ whitespaces
= - a literal =
(\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).
The easiest solution would probably just be to strip it out after the fact, like so:
s = " =LIC100 "
s = s.strip('= ')
print(s)
#LIC100
If you insist on doing it within the regex, you can try something like:
reg = r'Name[ =]+([A-Za-z0-9]+)\s+State'
Your current regex is failing because (.*) captures all characters until the occurance of State. Instead of capturing everything, you can use a positive lookbehind to describe what preceeds, but is not included in, the content you actually want to capture. In this case, "Name =" preceeds the match, so we can stick it in the lookbehind assertion as (?<=Name =), then proceed to capture everything until the next whitespace:
>>> import re
>>> s = 'Tag Name =LIC100 State =TRUE'
>>> r = re.compile("(?<=Name =)\w*")
>>> print(r.search(s))
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
>>> print(r.search(s).group(0))
LIC100
Following the tips above, I manage to find a nice solution.
Actually, the string I am trying to process has some non-printable characters. It is like this
"Tag Name\x00=LIC100\x00\tState=TRUE"
Using the concept of lookahead and lookbehind I found the following solution:
import re
s = 'Tag Name\x00=LIC100\x00\tState=TRUE'
T=re.search(r'(?<=Name\x00=)(.*)(?=\x00\tState)',s)
print(T.group(0))
The nice thing about this is that the outcome does not have any non-printable character on it.
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>

Why does the regex (.*?(?: *?\n)) capture newlines?

Consider the text below:
foobar¬
nextline
The regex (.*?(?: *?\n)) matches foobar¬
where ¬ denotes a newline \n.
Why does the regex match it? shouldn't the non-capture group exclude it?
Tested on Regex101 for the python dialect.
“Non-capturing group” refers to the fact that matches within that group will not be available as separate groups in the resulting match object. For example:
>>> re.search('(foo)(bar)', 'foobarbaz').groups()
('foo', 'bar')
>>> re.search('(foo)(?:bar)', 'foobarbaz').groups()
('foo',)
However, everything that is part of an expression is matched and as such appears in the resulting match (Group 0 shows the whole match):
>>> re.search('(foo)(bar)', 'foobarbaz').group(0)
'foobar'
>>> re.search('(foo)(?:bar)', 'foobarbaz').group(0)
'foobar'
If you don’t want to match that part but still want to make sure it’s there, you can use a lookahead expression:
>>> re.search('(foo)(?=bar)', 'foobarbaz')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search('(foo)(?=bar)', 'foobaz')
None
So in your case, you could use (.*?(?= *?\n)).
The \n is captured because the non-capturing group is inside the capturing group:
>>> s = 'foobar\nnextline'
>>> re.search(r'(.*?(?: *?\n))', s).groups()
('foobar\n',)
If you don't want that, place the non-capturing group outside of the capturing one:
>>> re.search(r'(.*?)(?: *?\n)', s).groups()
('foobar',)

python regular expression grouping

My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)
Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?
(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14
use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>
Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

Categories

Resources