Python: regex: find if exists, else ignore - python

I need help with re module. I have pattern:
pattern = re.compile('''first_condition\((.*)\)
extra_condition\((.*)\)
testing\((.*)\)
other\((.*)\)''', re.UNICODE)
That's what happens if I run regex on the following text:
text = '''first_condition(enabled)
extra_condition(disabled)
testing(example)
other(something)'''
result = pattern.findall(text)
print(result)
[('enabled', 'disabled', 'example', 'something')]
But if one or two lines were missed, regex returns empty list. E.g. my text is:
text = '''first_condition(enabled)
other(other)'''
What I want to get:
[('enabled', '', '', 'something')]
I could do it in several commands, but I think that it will be slower than doing it in one regex. Original code uses sed, so it is very fast. I could do it using sed, but I need cross-platform way to do it. Is it possible to do? Tnanks!
P.S. It will be also great if sequence of strings will be free, not fixed:
text = '''other(other)
first_condition(enabled)'''
must return absolutely the same:
[('enabled', '', '', 'something')]

I would parse it to a dictionary first:
import re
keys = ['first_condition', 'extra_condition', 'testing', 'other']
d = dict(re.findall(r'^(.*)\((.*)\)$', text, re.M))
result = [d.get(key, '') for key in keys]
See it working online: ideone

Use a non-matching group for optional stuff, and make the group optional by putting a question mark after the group.
Example:
pat = re.compile(r'a\(([^)]+)\)(?:b\((?P<bgr>[^)]+)\)?')
Sorry but I can't test this right now.
The above requires a string like a(foo) and grabs the text in parents as group 0.
Then it optionally matches a string like b(foo)and if it is matched it will be saved as a named group with name: bgr
Note that I didn't use .* to match inside the parens but [^)]+. This definitely stops matching when it reaches the closing paren, and requires at least one character. You could use [^)]* if the parens can be empty.
These patterns are getting complicated so you might want to use verbose patterns with comments.
To have several optional patterns that might appear in any order, put them all inside a non-matching group and separate them with vertical bars. You will need to use named match groups because you won't know the order. Put an asterisk after the non-matching group to allow for any number of the alternative patterns to be present (including zero if none are present).

Related

Regex Puzzle: Match a pattern only if it is between two $$ without indefinite look behind

I am writing a snippet for the Vim plugin UltiSnips which will trigger on a regex pattern (as supported by Python 3). To avoid conflicts I want to make sure that my snippet only triggers when contained somewhere inside of $$___$$. Note that the trigger pattern might contain an indefinite string in front or behind it. So as an example I might want to match all "a" in "$$ccbbabbcc$$" but not "ccbbabbcc". Obviously this would be trivial if I could simply use indefinite look behind. Alas, I may not as this isn't .NET and vanilla Python will not allow it. Is there a standard way of implementing this kind of expression? Note that I will not be able to use any python functions. The expression must be a self-contained trigger.
If what you are looking for only occurs once between the '$$', then:
\$\$.*?(a)(?=.*?\$\$)
This allows you to match all 3 a characters in the following example:
\$\$) Matches '$$'
.*? Matches 0 or more characters non-greedily
(?=.*?\$\$) String must be followed by 0 or more arbitrary characters followed by '$$'
The code:
import re
s = "$$ccbbabbcc$$xxax$$bcaxay$$"
print(re.findall(r'\$\$.*?(a)(?=.*?\$\$)', s))
Prints:
['a', 'a', 'a']
The following should work:
re.findall("\${2}.+\${2}", stuff)
Breakdown:
Looks for two '$'
"\${2}
Then looks for one or more of any character
.+
Then looks for two '$' again
I believe this regex would work to match the a within the $$:
text = '$$ccbbabbcc$$ccbbabbcc'
re.findall('\${2}.*(a).*\${2}', text)
# prints
['a']
Alternatively:
A simple approach (requiring two checks instead of one regex) would be to first find all parts enclosed in your quoting text, then check if your search string is present withing.
example
text = '$$ccbbabbcc$$ccbbabbcc'
search_string = 'a'
parts = re.findall('\${2}.+\${2}', text)
[p for p in parts if search_string in p]
# prints
['$$ccbbabbcc$$']

Python Regex Skipping Optional Groups

I am trying to extract a doctor's name and title from a string. If "dr" is in the string, I want it to use that as the title and then use the next word as the doctor's name. However, I also want the regex to be compatible with strings that do not have "dr" in them. In that case, it should just match the first word as the doctor's name and assume no title.
I have come up with the following regex pattern:
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
As I understand it, this should optionally match the letters "dr" (with or without a following period) and then a space, followed by a series of letters, case-insensitive. The problem is, it seems to only pick up the optional "dr" title if it is at the beginning of the string.
import re
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
test1 = "Dr Joseph Fox"
test2 = "Joseph Fox"
test3 = "Optometry by Dr Joseph Fox"
print pattern.search(test1).groups()
print pattern.search(test2).groups()
print pattern.search(test3).groups()
The code returns this:
('Dr ', 'Joseph')
(None, 'Joseph')
(None, 'Optometry')
The first two scenarios make sense to me, but why does the third not find the optional "Dr"? Is there a way to make this work?
You're seeing this behavior because regexes tend to be greedy and accept the first possible match. As a result, your regex is accepting only the first word of your third string, with no characters matching the first group, which is optional. You can see this by using the findall regex function:
>>> print pattern.findall(test3)
[('', 'Optometry'), ('', ''), ('', 'by'), ('', ''), ('Dr ', 'Joseph'), ('', ''), ('', 'Fox'), ('', '')]
It's immediately obvious that 'Dr Joseph' was successfully found, but just wasn't the first matching part of your string.
In my experience, trying to coerce regexes to express/capture multiple cases is often asking for inscrutable regexes. Specifically answering your question, I'd prefer to run the string through one regex requiring the 'Dr' title, and if I fail to get any matches, just split on spaces and take the first word (or however you want to go about getting the first word).
Regular expression engines match greedily from left to right. In other words: there is no "best" match and the first match will always be returned. You can do a global search, though...check out re.findall().
Your regex basically accepts any word, therefore it will be difficult to choose which one is the name of the doctor even after using findall if the dr is not present.
Is the re.IGNORECASE really important? Are you only interested in the name of the doctor or both name and surname?
I would reccomend using a regex that matches two words starting with uppercase and only one space in between, maintaining the optional dr before.
If re.ignorecase is really important, maybe it is better to make first a search for dr, and if it is unsuccessful, then store the first word as the name or something like that as proposed before
Look for (?<=...) syntax: Python Regex
Your re pattern will look about like this:
(DR\.? )?(?<=DR\.? )([A-Z]*)
You are only looking for Dr when the string starts with it, you aren't searching for a string containing Dr.
try
pattern = re.compile('(.*DR\.? )?([A-Z]*)', re.IGNORECASE)

Using regex to find multiple matches on the same line

I need to build a program that can read multiple lines of code, and extract the right information from each line.
Example text:
no matches
one match <'found'>
<'one'> match <found>
<'three'><'matches'><'found'>
For this case, the program should detect <'found'>, <'one'>, <'three'>, <'matches'> and <'found'> as matches because they all have "<" and "'".
However, I cannot work out a system using regex to account for multiple matches on the same line. I was using something like:
re.search('^<.*>$')
But if there are multiple matches on one line, the extra "'<" and ">'" are taken as part of the .*, without counting them as separate matches. How do I fix this?
This works -
>>> r = re.compile(r"\<\'.*?\'\>")
>>> r.findall(s)
["<'found'>", "<'one'>", "<'three'>", "<'matches'>", "<'found'>"]
Use findall instead of search:
re.findall( r"<'.*?'>", str )
You can use re.findall and match on non > characters inside of the angle brackets:
>>> re.findall('<[^>]*>', "<'three'><'matches'><'found'>")
["<'three'>", "<'matches'>", "<'found'>"]
Non-greedy quantifier '?' as suggested by anubhava is also an option.

Replacing only a specific group within a matched expression

I'm parsing text in which I would like to make changes, but only to specific lines.
I have a regular expression pattern that catches the entire line if it's a line of interest, and within the expression I have a remembered group of the thing I would actually like to change.
I would like to be able to changed only the specific group within a matched expression, and not replace the entire expression (that would replace the entire line).
For example:
I have a textual file with:
This is a completely silly example.
something something "this should be replaced" bla.
more uninteresting stuff
And I have the regex:
pattern = '.*("[^"]*").*'
Then I catch the second line, but I would to replace only the "this should be replaced" matched group within the line, not the entire line. (so using re.sub(pattern, replacement, string) won't do the job.
Thanks in advance!
What's wrong with
r'"[^"]+"'
Your .* before and after the matched expression match zero-length-string too, so you don't need it at all.
re.sub(r'"[^"]+"', 'DEF', 'abc"def"ghi')
# returns 'abcDEFghi'
and your example text will result into:
'This is a completely silly example.\nsomething something DEF bla.\nmore uninteresting stuff
eumiro answer is best in this very case, but for the sake of completeness, if you really need to perform some more complicated processing of pre, inside, and post text, you can simply use multiple groups, like:
'(.*)("[^"]*")(.*)'
(first group provides the the text before, third the text after, do what you like with them)
Also, you may prefer to forbid " in the pre-part:
'([^"]*)("[^"]*")(.*)'
re.match and re.search return a "match object". (See the python documentation). Supposing you want to replace group 3 in your RE, pull out its start/end indices and replace the substring directly:
mobj = re.match(pattern, line)
start = mobj.start(3)
end = mobj.end(3)
line = line[:start] + replacement + line[end:]

Extracting content BETWEEN a regex python?

Is there a simple method to pull content between a regex? Assume I have the following sample text
SOME TEXT [SOME MORE TEXT] value="ssss" SOME MORE TEXT
My regex is:
compiledRegex = re.compile('\[.*\] value=("|\').*("|\')')
This will obviously return the entire [SOME MORE TEXT] value="ssss", however I only want ssss to be returned since that's what I'm looking for
I can obviously define a parser function but I feel as if python provides some simple pythonic way to do such a task
This is what capturing groups are designed to do.
compiledRegex = re.compile('\[.*\] value=(?:"|\')(.*)(?:"|\')')
matches = compiledRegex.match(sampleText)
capturedGroup = matches.group(1) # grab contents of first group
The ?: inside the old groups (the parentheses) means that the group is now a non-capturing group; that is, it won't be accessible as a group in the result. I converted them to keep the output simpler, but you can leave them as capturing groups if you prefer (but then you have to use matches.group(2) instead, since the first quote would be the first captured group).
Your original regex is too greedy: r'.*\]' won't stop at the first ']' and the second '.*' won't stop at '"'. To stop at c you could use [^c] or '.*?':
regex = re.compile(r"""\[[^]]*\] value=("|')(.*?)\1""")
Example
m = regex.search("""SOME TEXT [SOME MORE TEXT] value="ssss" SOME MORE TEXT""")
print m.group(2)

Categories

Resources