I am trying to match a very simple pattern using Python's regex package (I am new to regex). I don't understand the following behavior:
import regex
regex.match('economy', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
or
regex.match('ARTICLE', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
doesn't match anything. Of course if I do
regex.match('economy', 'economy')
it does it. Why that is the case?
Also, if I want to match case sensitive 'ARTCLE' in the above example, what should be right way to do it?
I am usng 2016.1.10 version of regex.
match looks for a match at the start of the string. If you want to match other than the start you need to use search.
I don't have regex installed here but it should be the same as re.
>>> re.search('economy', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
<_sre.SRE_Match object; span=(35, 42), match='economy'>
Related
I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.
I need to understand why regular expression is matching greedily when I am specifying it not to.
Given string='.GATA..GATA..ETS..ETS.'
Return the shortest substring of GATA...ETS
I use the regex pattern pattern = r'(GATA).*?(ETS)'
syntax_finder=re.compile(pattern,re.IGNORECASE)
for match in syntax_finder.finditer(string):
print(match)
Returns <re.Match object; span=(1, 17), match='GATA..GATA..ETS'>
However, I want it to return 'GATA..ETS'
Does anyone know why this is happening?
I am not looking for a solution to this exact matching problem. I will be doing a lot of these types of searches with more complicated patterns of GATA and ETS, but I will always want it to return the shortest match.
Thanks!
Does anyone know why this is happening?
The regex matches non-greedily. It finds the first GATA and then, because .*? is used rather than .*, matches until the first ETS after that. It just happens that there is another GATA in the way, which you don't want - but which non-greedy matching doesn't care about.
I will be doing a lot of these types of searches with more complicated patterns of GATA and ETS
Then regexes are probably underpowered for the job. My suggestion is to use them to split the string into GATA, ETS and intervening portions (tokenization), and then use other techniques to find the patterns in that sequence (parsing).
I am not looking for a solution to this exact matching problem.
But I can't resist :)
>>> re.search(r'(GATA)((?<!GAT)A|[^A])*?(ETS)', '.GATA..GATA..ETS..ETS.')
<_sre.SRE_Match object; span=(7, 16), match='GATA..ETS'>
Here we use a negative lookbehind assertion: while scanning the part between GATA and ETS, we only allow an A if it is not preceded by GAT.
I'm using the regex module in Python3, and I want to be able to check a string to match that of a "zero or singley-templated" C++ datatype, such as Foo, Foo<Bar>, Foo<Bar<Baz>>, Foo<Bar<Baz<Hello<World>>>>, etc.
At the moment, I have (<X(?R)?>)*, where X is some text. This almost works for all of those examples given, it's just that they have to be surrounded by <> pairs themselves as well.
I'm looking for a way to be able to have some text out front of what's considered this recursive portion. Is this possible with regular expressions?
regex does allow for recursive regular expressions (these expressions aren't strictly regular, which is why you're getting conflicting information), you just need to add a base case:
(?>\w+<(?R)>)|\w+
This matches with
regex.match(r"(?>\w+<(?R)>)|\w+", "Foo<Bar<Baz>>")
# <regex.Match object; span=(0, 13), match='Foo<Bar<Baz>>'>
I'm trying to get a list of files from a directory whose file names follow this pattern:
PREFIX_YYYY_MM_DD.dat
For example
FOO_2016_03_23.dat
Can't seem to get the right regex. I've tried the following:
pattern = re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
>>> []
pattern = re.compile(r'*(\d{4})_(\d{2})_(\d{2}).dat')
>>> sre_constants.error: nothing to repeat
Regex is certainly a weakpoint for me. Can anyone explain where I'm going wrong?
To get the files, I'm doing:
files = [f for f in os.listdir(directory) if pattern.match(f)]
PS, how would I allow for .dat and .DAT (case insensitive file extension)?
Thanks
You have two issues with your expression:
re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
The first one, as a previous comment stated, is that the . right before dat should be escaped by putting a backslash (\) before. Otherwise, python will treat it as a special character, because in regex . represents "any character".
Besides that, you're not handling uppercase exceptions on your expression. You should make a group for this with dat and DAT as possible choices.
With both changes made, it should look like:
re.compile(r'(\d{4})_(\d{2})_(\d{2})\.(?:dat|DAT)')
As an extra note, I added ?: at the beginning of the group so the regex matcher ignores it at the results.
Use pattern.search() instead of pattern.match().
pattern.match() always matches from the start of the string (which includes the PREFIX).
pattern.search() searches anywhere within the string.
Does this do what you want?
>>> import re
>>> pattern = r'\A[a-z]+_\d{4}_\d{2}_\d{2}\.dat\Z'
>>> string = 'FOO_2016_03_23.dat'
>>> re.search(pattern, string, re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 18), match='FOO_2016_03_23.dat'>
>>>
It appears to match the format of the string you gave as an example.
The following should match for what you requested.
[^_]+[_]\d{4}[_]\d{2}[_]\d{2}[\.]\w+
I recommend using https://regex101.com/ (for python regular expressions) or http://regexr.com/ (for javascript regular expressions) in the future if you want to validate your regular expressions.
If I have a given string s in Python, is it possible to easily check if a regex matches the string starting at a specific position i in the string?
I would rather not slice the entire string from i to the end as it doesn't seem very scalable (ruling out re.match I think).
re.match doesn't support this directly. However, if you pre-compile your regular expression (often a good idea anyway) with re.compile, then the RegexObject's similar method, match (and search) both take an optional pos parameter:
The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.
Example:
import re
s = 'this is a test 4242 did you get it'
pat = re.compile('[a-zA-Z]+ ([0-9]+)')
print pat.match(s, 10).group(0)
Output:
'test 4242'
Although re.match does not support this, the new regex module (intended to replace the re module) has a treasure trove of new features, including pos and endpos arguments for search, match, sub, and subn. Although not official yet, the regex module can be pip installed and works for Python versions 2.5 through 3.4. Here's an example:
>>> import regex
>>> regex.match(r'\d+', 'abc123def')
>>> regex.match(r'\d+', 'abc123def', pos=3)
<regex.Match object; span=(3, 6), match='123'>
>>> regex.match(r'\d+', 'abc123def', pos=3, endpos=5)
<regex.Match object; span=(3, 5), match='12'>