I have a more challenging task, but first I am faced with this issue. Given a string s, I want to extract all the groups of characters marked by some delimiter, e.g. parentheses. How can I accomplish this using regular expressions (or any Pythonic way)?
import re
>>> s = '(3,1)-[(7,2),1,(a,b)]-8a'
>>> pattern = r'(\(.+\))'
>>> re.findall(pattern, s).group() # EDITED: findall vs. search
['(3,1)-[(7,2),1,(a,b)']
# Desire result
['(3,1)', '(7,2)', '(a,b)']
Use findall() instead of search(). The former finds all occurences, the latter only finds the first.
Use the non-greedy ? operator. Otherwise, you'll find a match starting at the first ( and ending at the final ).
Note that regular expressions aren't a good tool for finding nested expressions like: ((1,2),(3,4)).
import re
s = '(3,1)-[(7,2),1,(a,b)]-8a'
pattern = r'(\(.+?\))'
print re.findall(pattern, s)
Use re.findall()
import re
data = '(3,1)-[(7,2),1,(a,b)]-8a'
found = re.findall('(\(\w,\w\))', data)
print found
Output:
['(3,1)', '(7,2)', '(a,b)']
Related
In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']
I want to search for some text in a line. As an example, text is:
{'id: 'id-::blabla1::blabal2-A'}
or
{'id: 'id-::blabla3::blabal4-B'}
or
{'id: 'id-::blabla5::blabal6-c'}
I want to find this text: A or B or C. How do I build a regular expression in python to do this?
I think you mean a dictionary although you miss a ' in both cases.
I guess something like this is what you're looking for:
import re
dict = {'id': 'id-::blabla1::blabal2-A'}
test = re.sub(r'.+?::.+?::.+?-(\w)',r'\1',dict['id'])
regex could be simplified but this is all I can do you for based on this info
You can start with this one :
{'id: 'id-(?::.*?){2}-([a-zA-Z])'}
see : https://regex101.com/r/XnlUTi/2
([a-zA-Z])
This will be the group match who return A or B or c
import re
content = "{'id: 'id-::blabla1::blabal2-A'}"
pattern = re.compile('{\'id: \'id-::blabla.*?::blabal.*?-(.*?)\'}', re.S)
print re.findall(pattern, content)
.*? represents anything, (.*?) represents things that you want.
re.findall(pattern, content) will return a list that meet the regular expression.
I have this regex:
"\w{4}[A-D]{1}[a-d]*\s*"
How can I repeat the part of [A-D]{1}[a-d]*\s* several time with something like *?
So if I have the expression:
"Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab"
the regex will give me:
"Bed0Dabc Babc Cabb"
"99rrAbaaaa Daa"
Your regex is invalid and lacks "\" at start, your desired output is also invalid and second string should be "99rrAbaaaa Daa".
I believe what you mean is groups, this is a pretty basic concept though, you should probably read more about regular expressions before using them.
The desired regex:
\w{4}([A-D][a-d]*\s*)+
You should add the \s to the set of characters.
import re
data = 'Bed0Dabc Babc Cabb99rrAbaaaa Daa6ab'
pattern = r'\w{4}(?:[A-D][a-d\s]*)+'
matches = re.findall(pattern, data)
The result:
['Bed0Dabc Babc Cabb', '99rrAbaaaa Daa']
The ?: at the start of the group defines a non-capturing group. If you omit it your result will look like this.
['Cabb', 'Daa']
My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)
Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?
(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14
use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>
Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Is there a way to see if a line contains words that matches a set of regex pattern?
If I have [regex1, regex2, regex3], and I want to see if a line matches any of those, how would I do this?
Right now, I am using re.findall(regex1, line), but it only matches 1 regex at a time.
You can use the built in functions any (or all if all regexes have to match) and a Generator expression to cycle through all the regex objects.
any (regex.match(line) for regex in [regex1, regex2, regex3])
(or any(re.match(regex_str, line) for regex in [regex_str1, regex_str2, regex_str2]) if the regexes are not pre-compiled regex objects, of course)
However, that will be inefficient compared to combining your regexes in a single expression. If this code is time- or CPU-critical, you should try instead to compose a single regular expression that encompasses all your needs, using the special | regex operator to separate the original expressions.
A simple way to combine all the regexes is to use the string join method:
re.match("|".join([regex_str1, regex_str2, regex_str2]), line)
A warning about combining the regexes in this way: It can result in wrong expressions if the original ones already do make use of the | operator.
Try this new regex: (regex1)|(regex2)|(regex3). This will match a line with any of the 3 regexs in it.
You cou loop through the regex items and do a search.
regexList = [regex1, regex2, regex3]
line = 'line of data'
gotMatch = False
for regex in regexList:
s = re.search(regex,line)
if s:
gotMatch = True
break
if gotMatch:
doSomething()
#quite new to python but had the same problem. made this to find all with multiple
#regular #expressions.
regex1 = r"your regex here"
regex2 = r"your regex here"
regex3 = r"your regex here"
regexList = [regex1, regex1, regex3]
for x in regexList:
if re.findall(x, your string):
some_list = re.findall(x, your string)
for y in some_list:
found_regex_list.append(y)#make a list to add them to.