I'm fairly inexperienced with regex, but I need one to match the parameter of a function. This function will appear multiple times in the string, and I would like to return a list of all parameters.
The regex must match:
Alphanumeric and underscore
Inside quotes directly inside parenthesis
After a specific function name
Here's an example string:
Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])
and I would like this as output:
['_xyx', 'y', 'z_', x_1', 'x', 'y']
What I have so far:
(?<=Atom\(')[\w|_]*
I'm calling this with:
import re
s = "Generic3(p, [Generic3(g, [Atom('x'), Atom('y'), Atom('z')]), Atom('x'), Generic2(f, [Atom('x'), Atom('y')])])"
print(re.match(r"(?<=Atom\(')[\w|_]*", s))
But this just prints None. I feel like I'm nearly there, but I'm missing something, maybe on the Python side to actually return the matches.
Your regex is close, you need to add \W character to find the underscore:
s = "Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])"
r = "(?<=Atom\()\W\w+"
final_data = re.findall(r, s)
You can also try this:
import re
s = "Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])"
new_data = re.findall("Atom\('(.*?)'\)", s)
Output:
['_xyx', 'y', 'z_', 'x_1', 'x', 'y']
Related
I have multiple parentheses and want to remove the parentheses that have at least one number in.
I have tried the following. However, since it is greedy, it removes the first open parenthesis to the last close parenthesis. I have also tried to destroy the greedy feature by excluding an open parenthesis but did not work.
names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
data = pd.DataFrame(names, columns = ['name'])
print(data.name.str.replace("\(.*?\d+.*?\)", ""))
# Output: ['d)', 'd(ab)', 'd']
print(data.name.str.replace("\((?!\().*[\d]+(?!\().*\)",""))
# Output: ['d(', 'd', 'd']
# desired output: ['d', 'd(ab)', 'd(ab)']
This regex seems valid: \([^)\d]*?\d+[^)]*?\)+
>>> pattern = '\([^)\d]*?\d+[^)]*?\)+'
>>> names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
>>> [re.sub(pattern, '', x) for x in names]
['d', 'd(ab)', 'd(ab)']
I don't know if there are more complex cases but for those that you've supplied and similar, it should do the trick.
Although Python does not support recursive regex, you can enable
it by installing regex module with:
pip install regex
Then you can say something like:
import regex
names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
pattern = r'\((?:[^()]*?\d[^()]*?|(?R))+\)'
print ([regex.sub(pattern, '', x) for x in names])
Output:
['d', 'd(ab)', 'd(ab)']
I need to replace the value inside a capture group of a regular expression with some arbitrary value; I've had a look at the re.sub, but it seems to be working in a different way.
I have a string like this one :
s = 'monthday=1, month=5, year=2018'
and I have a regex matching it with captured groups like the following :
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
now I want to replace the group named d with aaa, the group named m with bbb and group named Y with ccc, like in the following example :
'monthday=aaa, month=bbb, year=ccc'
basically I want to keep all the non matching string and substitute the matching group with some arbitrary value.
Is there a way to achieve the desired result ?
Note
This is just an example, I could have other input regexs with different structure, but same name capturing groups ...
Update
Since it seems like most of the people are focusing on the sample data, I add another sample, let's say that I have this other input data and regex :
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
as you can see I still have the same number of capturing groups(3) and they are named the same way, but the structure is totally different... What I need though is as before replacing the capturing group with some arbitrary text :
'ccc-bbb-aaa'
replace capture group named Y with ccc, the capture group named m with bbb and the capture group named d with aaa.
In the case, regexes are not the best tool for the job, I'm open to some other proposal that achieve my goal.
This is a completely backwards use of regex. The point of capture groups is to hold text you want to keep, not text you want to replace.
Since you've written your regex the wrong way, you have to do most of the substitution operation manually:
"""
Replaces the text captured by named groups.
"""
def replace_groups(pattern, string, replacements):
pattern = re.compile(pattern)
# create a dict of {group_index: group_name} for use later
groupnames = {index: name for name, index in pattern.groupindex.items()}
def repl(match):
# we have to split the matched text into chunks we want to keep and
# chunks we want to replace
# captured text will be replaced. uncaptured text will be kept.
text = match.group()
chunks = []
lastindex = 0
for i in range(1, pattern.groups+1):
groupname = groupnames.get(i)
if groupname not in replacements:
continue
# keep the text between this match and the last
chunks.append(text[lastindex:match.start(i)])
# then instead of the captured text, insert the replacement text for this group
chunks.append(replacements[groupname])
lastindex = match.end(i)
chunks.append(text[lastindex:])
# join all the junks to obtain the final string with replacements
return ''.join(chunks)
# for each occurence call our custom replacement function
return re.sub(pattern, repl, string)
>>> replace_groups(pattern, s, {'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
'monthday=aaa, month=bbb, year=ccc'
You can use string formatting with a regex substitution:
import re
s = 'monthday=1, month=5, year=2018'
s = re.sub('(?<=\=)\d+', '{}', s).format(*['aaa', 'bbb', 'ccc'])
Output:
'monthday=aaa, month=bbb, year=ccc'
Edit: given an arbitrary input string and regex, you can use formatting like so:
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
new_s = re.sub(regex, '{}', input).format(*["aaa", "bbb", "ccc"])
Extended Python 3.x solution on extended example (re.sub() with replacement function):
import re
d = {'d':'aaa', 'm':'bbb', 'Y':'ccc'} # predefined dict of replace words
pat = re.compile('(monthday=)(?P<d>\d{1,2})|(month=)(?P<m>\d{1,2})|(year=)(?P<Y>20\d{2})')
def repl(m):
pair = next(t for t in m.groupdict().items() if t[1])
k = next(filter(None, m.groups())) # preceding `key` for currently replaced sequence (i.e. 'monthday=' or 'month=' or 'year=')
return k + d.get(pair[0], '')
s = 'Data: year=2018, monthday=1, month=5, some other text'
result = pat.sub(repl, s)
print(result)
The output:
Data: year=ccc, monthday=aaa, month=bbb, some other text
For Python 2.7 :
change the line k = next(filter(None, m.groups())) to:
k = filter(None, m.groups())[0]
I suggest you use a loop
import re
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
s = 'monthday=1, month=1, year=2017 \n'
s+= 'monthday=2, month=2, year=2019'
regex_as_str = 'monthday={d}, month={m}, year={Y}'
matches = [match.groupdict() for match in regex.finditer(s)]
for match in matches:
s = s.replace(
regex_as_str.format(**match),
regex_as_str.format(**{'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
)
You can do this multile times wiht your different regex patterns
Or you can join ("or") both patterns together
I have this string in python
a = "haha"
result = "hh"
What i would like to achieve is using regex to replace all occurrences of "aha" to "h" and all "oho" to "h" and all "ehe" to "h"
"h" is just an example. Basically, i would like to retain the centre character. In other words, if its 'eae' i would like it to be changed to 'a'
My regex would be this
"aha|oho|ehe"
I thought of doing this
import re
reg = re.compile('aha|oho|ehe')
However, i am stuck on how to achieve this kind of substitution without using loops to iterate through all the possible combinations?
You can use re.sub:
import re
print re.sub('aha|oho|ehe', 'h', 'haha') # hh
print re.sub('aha|oho|ehe', 'h', 'hoho') # hh
print re.sub('aha|oho|ehe', 'h', 'hehe') # hh
print re.sub('aha|oho|ehe', 'h', 'hehehahoho') # hhhahh
What about re.sub(r'[aeo]h[aeo]','h',a) ?
I have a txt file which contains a lot of strings such as
Chr(101)
Chr(97)
Chr(104)
...
I am using the below code to find all occurrences of such strings using regex. What I'd like to do is to replace each occurrence with its evaluated output. So in this case I'd replace the above with:
e
a
h
The code I have is as follows:
with open(oFile, "r") as f:
for line in f:
# find all occurrences of Chr(\d+\) and put in a list
chrList = [str(s) for s in re.findall(r'Chr\(\d+\)', line)]
# print chrList
for c in chrList:
# print eval(c.lower())
out = re.sub(c, eval(c.lower()), line)
If I print the eval(c.lower()) line then it outputs as expected. However the re.sub line fails with the following error:
raise error, v # invalid expression sre_constants.error: bogus escape (end of line)
Not sure where I'm going wrong here.
You don't have to use distinct search and replace functions. You can invoke eval using the functional form of re.sub:
for line in f:
out = re.sub(r'Chr\(\d+\)', lambda c: eval(c.group(0).lower()), line)
print out
Your going to want to escape your search pattern because parenthesis are special characters in regular expressions. You can easily do this using re.escape.
out = re.sub(re.escape(c), eval(c.lower()), line)
And as an example:
strings = ['Chr(100)', 'Chr(101)', 'Chr(102)']
values = [re.sub(re.escape(c), eval(c.lower()), c) for c in strings]
# ['d', 'e', 'f']
That being said, why not just use replace()?
out = line.replace(c, eval(c.lower())
Same thing but without eval() or imports:
strings = ['Chr(100)', 'Chr(101)', 'Chr(102)']
values = [chr(x) for x in (int(c.replace("Chr(", "").replace(")","")) for c in strings)]
Is there a way to determine how many capture groups there are in a given regular expression?
I would like to be able to do the follwing:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * num_of_groups(regexp)
This allows me to do stuff like:
first, last, phone = groups(r'(\w+) (\w+) ([\d\-]+)', 'John Doe 555-3456')
However, I don't know how to implement num_of_groups. (Currently I just work around it.)
EDIT: Following the advice from rslite, I replaced re.findall with re.search.
sre_parse seems like the most robust and comprehensive solution, but requires tree traversal and appears to be a bit heavy.
MizardX's regular expression seems to cover all bases, so I'm going to go with that.
def num_groups(regex):
return re.compile(regex).groups
f_x = re.search(...)
len_groups = len(f_x.groups())
Something from inside sre_parse might help.
At first glance, maybe something along the lines of:
>>> import sre_parse
>>> sre_parse.parse('(\d)\d(\d)')
[('subpattern', (1, [('in', [('category', 'category_digit')])])),
('in', [('category', 'category_digit')]),
('subpattern', (2, [('in', [('category', 'category_digit')])]))]
I.e. count the items of type 'subpattern':
import sre_parse
def count_patterns(regex):
"""
>>> count_patterns('foo: \d')
0
>>> count_patterns('foo: (\d)')
1
>>> count_patterns('foo: (\d(\s))')
1
"""
parsed = sre_parse.parse(regex)
return len([token for token in parsed if token[0] == 'subpattern'])
Note that we're only counting root level patterns here, so the last example only returns 1. To change this, tokens would need to searched recursively.
First of all if you only need the first result of re.findall it's better to just use re.search that returns a match or None.
For the groups number you could count the number of open parenthesis '(' except those that are escaped by '\'. You could use another regex for that:
def num_of_groups(regexp):
rg = re.compile(r'(?<!\\)\(')
return len(rg.findall(regexp))
Note that this doesn't work if the regex contains non-capturing groups and also if '(' is escaped by using it as '[(]'. So this is not very reliable. But depending on the regexes that you use it might help.
Using your code as a basis:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * len(m.groups())
Might be wrong, but I don't think there is a way to find the number of groups that would have been returned had the regex matched. The only way I can think of to make this work the way you want it to is to pass the number of matches your particular regex expects as an argument.
To clarify though: When findall succeeds, you only want the first match to be returned, but when it fails you want a list of empty strings? Because the comment seems to show all matches being returned as a list.