Python re match at specific point in string - python

If I have a given string s in Python, is it possible to easily check if a regex matches the string starting at a specific position i in the string?
I would rather not slice the entire string from i to the end as it doesn't seem very scalable (ruling out re.match I think).

re.match doesn't support this directly. However, if you pre-compile your regular expression (often a good idea anyway) with re.compile, then the RegexObject's similar method, match (and search) both take an optional pos parameter:
The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.
Example:
import re
s = 'this is a test 4242 did you get it'
pat = re.compile('[a-zA-Z]+ ([0-9]+)')
print pat.match(s, 10).group(0)
Output:
'test 4242'

Although re.match does not support this, the new regex module (intended to replace the re module) has a treasure trove of new features, including pos and endpos arguments for search, match, sub, and subn. Although not official yet, the regex module can be pip installed and works for Python versions 2.5 through 3.4. Here's an example:
>>> import regex
>>> regex.match(r'\d+', 'abc123def')
>>> regex.match(r'\d+', 'abc123def', pos=3)
<regex.Match object; span=(3, 6), match='123'>
>>> regex.match(r'\d+', 'abc123def', pos=3, endpos=5)
<regex.Match object; span=(3, 5), match='12'>

Related

Python Regex \pL matching issues

I'm trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).
Here's an example
import regex as re
p = r'((?!\pL)|^)blah((?!\pL)|$)'
print(re.search(p, "blah u"))
print(re.search(p, "blahé u"))
print(re.search(p, "éblah u"))
print(re.search(p, "blahaha"))
gives:
<regex.Match object; span=(0, 4), match='blah'>
None
None
None
Which looks correct. However:
print(re.search(p, "u blah"))
gives:
None
This is wrong, as I expect a match for "u blah".
I've tried to also use Pythons built in re module, but I cannot get it to work with \pL or \p{Latin} due to "bad-escape" errors. I've also tried to use unicode strings (without the "r") but despite adding slashes to \\\\pL, I just can't get this to work right.
Note: I'm using Python 3.8
The problem with your ((?!\pL)|^)blah((?!\pL)|$) regex is that the ((?!\pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!\pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!\pL)|$) and only matches at the start of string.
Note (?!\pL) already matches a position at the end of string, so ((?!\pL)|$) = (?!\pL).
You should use
(?<!\pL)blah(?!\pL)
See the regex demo (switched to PCRE for the demo purposes).
Note that the re-compatible version of the regex is
(?<![^\W\d_])blah(?![^\W\d_])
See the regex demo.

Python regex Doesn't Match a Simple Pattern

I am trying to match a very simple pattern using Python's regex package (I am new to regex). I don't understand the following behavior:
import regex
regex.match('economy', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
or
regex.match('ARTICLE', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
doesn't match anything. Of course if I do
regex.match('economy', 'economy')
it does it. Why that is the case?
Also, if I want to match case sensitive 'ARTCLE' in the above example, what should be right way to do it?
I am usng 2016.1.10 version of regex.
match looks for a match at the start of the string. If you want to match other than the start you need to use search.
I don't have regex installed here but it should be the same as re.
>>> re.search('economy', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
<_sre.SRE_Match object; span=(35, 42), match='economy'>

Python Regex find standalone case

I have a string of time for example:
text = '2010; 04/20/2010; 04/2009'
I want to only find the first standalone '2010', but applying the following code:
re.findall(r'\d{4}', text)
will also find the second '2010' embedded in the mm/dd/yyyy format.
Is there a way to achieve this (not using the ';' sign)?
You can use re.search to find only the first occurrence:
>>> import re
>>> text = '2010; 04/20/2010; 04/2009'
>>> re.search('\d{4}', text)
<_sre.SRE_Match object; span=(0, 4), match='2010'>
>>> re.search('\d{4}', text).group()
'2010'
>>>
From the documentation:
re.search(pattern, string, flags=0)
Scan through string looking for
the first location where the regular expression pattern produces a
match, and return a corresponding match object. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
Emphasis mine.
I don't know if you have to use regex but .find() in Python3 will return the lowest index of the start of the string you are looking for. From there if you know the length of the string which I assume you do you can extrapolate it out with a slice of the string with another line of code. Not sure if it's better or worse than regex but it seems less complex version that does the same thing for this occurrence. Here is a stack overflow about it and here is the python docs on it

Python regex numbers and underscores

I'm trying to get a list of files from a directory whose file names follow this pattern:
PREFIX_YYYY_MM_DD.dat
For example
FOO_2016_03_23.dat
Can't seem to get the right regex. I've tried the following:
pattern = re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
>>> []
pattern = re.compile(r'*(\d{4})_(\d{2})_(\d{2}).dat')
>>> sre_constants.error: nothing to repeat
Regex is certainly a weakpoint for me. Can anyone explain where I'm going wrong?
To get the files, I'm doing:
files = [f for f in os.listdir(directory) if pattern.match(f)]
PS, how would I allow for .dat and .DAT (case insensitive file extension)?
Thanks
You have two issues with your expression:
re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
The first one, as a previous comment stated, is that the . right before dat should be escaped by putting a backslash (\) before. Otherwise, python will treat it as a special character, because in regex . represents "any character".
Besides that, you're not handling uppercase exceptions on your expression. You should make a group for this with dat and DAT as possible choices.
With both changes made, it should look like:
re.compile(r'(\d{4})_(\d{2})_(\d{2})\.(?:dat|DAT)')
As an extra note, I added ?: at the beginning of the group so the regex matcher ignores it at the results.
Use pattern.search() instead of pattern.match().
pattern.match() always matches from the start of the string (which includes the PREFIX).
pattern.search() searches anywhere within the string.
Does this do what you want?
>>> import re
>>> pattern = r'\A[a-z]+_\d{4}_\d{2}_\d{2}\.dat\Z'
>>> string = 'FOO_2016_03_23.dat'
>>> re.search(pattern, string, re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 18), match='FOO_2016_03_23.dat'>
>>>
It appears to match the format of the string you gave as an example.
The following should match for what you requested.
[^_]+[_]\d{4}[_]\d{2}[_]\d{2}[\.]\w+
I recommend using https://regex101.com/ (for python regular expressions) or http://regexr.com/ (for javascript regular expressions) in the future if you want to validate your regular expressions.

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Categories

Resources